**SoC Blockset™** 

User's Guide



# MATLAB® SIMULINK®



## **How to Contact MathWorks**



Latest news: www.mathworks.com

Sales and services: www.mathworks.com/sales\_and\_services

User community: www.mathworks.com/matlabcentral

Technical support: www.mathworks.com/support/contact\_us

T

Phone: 508-647-7000



The MathWorks, Inc. 1 Apple Hill Drive Natick, MA 01760-2098

SoC Blockset<sup>™</sup> User's Guide

© COPYRIGHT 2019 by The MathWorks, Inc.

The software described in this document is furnished under a license agreement. The software may be used or copied only under the terms of the license agreement. No part of this manual may be photocopied or reproduced in any form without prior written consent from The MathWorks, Inc.

FEDERAL ACQUISITION: This provision applies to all acquisitions of the Program and Documentation by, for, or through the federal government of the United States. By accepting delivery of the Program or Documentation, the government hereby agrees that this software or documentation qualifies as commercial computer software or commercial computer software documentation as such terms are used or defined in FAR 12.212, DFARS Part 227.72, and DFARS 252.227-7014. Accordingly, the terms and conditions of this Agreement and only those rights specified in this Agreement, shall pertain to and govern the use, modification, reproduction, release, performance, display, and disclosure of the Program and Documentation by the federal government (or other entity acquiring for or through the federal government) and shall supersede any conflicting contractual terms or conditions. If this License fails to meet the government's needs or is inconsistent in any respect with federal procurement law, the government agrees to return the Program and Documentation, unused, to The MathWorks, Inc.

#### **Trademarks**

MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders.

#### **Patents**

MathWorks products are protected by one or more U.S. patents. Please see www.mathworks.com/patents for more information.

#### **Revision History**

March 2019 Online Only New for Version 1.0 (Release 2019a)
September 2019 Online only Revised for Version 1.1 (Release 2019b)

## Contents

## **Create SoC Models**

| What is Task Execution?  Task Execution Life Cycle  Task and Thread                                | 1-2<br>1-2<br>1-3            |
|----------------------------------------------------------------------------------------------------|------------------------------|
| <b>Event-Driven Tasks</b>                                                                          | 1-4<br>1-4                   |
| Timer-Driven Task                                                                                  | 1-9<br>1-9                   |
| Kernel Latency                                                                                     | 1-12<br>1-13                 |
| Task Duration                                                                                      | 1-16                         |
| Approximation Using Calculated Probability Distribution Specification from Task Manager Input Port | 1-16<br>1-18<br>1-19<br>1-19 |
| Value and Caching of Task Subsystem Signals                                                        | 1-20                         |
| Memory and Register Data Transfers                                                                 | 1-21<br>1-21<br>1-23         |
| AXI4-Stream Interface                                                                              | 1-25<br>1-25<br>1-26         |
| Simplified AXI4 Master Interface                                                                   | 1-28<br>1-28                 |

| Simplified AXI4 Master Protocol - Read Channel | 1-30         |
|------------------------------------------------|--------------|
| AXI4-Stream Video Interface                    | 1-32<br>1-32 |
| Protocol Signals and Timing Diagrams           | 1-33         |
| Models with I/O Device Blocks                  | 1-36         |
| Processor I/O with Memory Model                | 1-36         |
| Processor I/O with Network Device Model        | 1-36         |
| Use Template to Create SoC Model               | 1-37         |
| Create SoC Model Using SoC Blockset Template   | 1-37         |
| Template Structure                             | 1-39         |
| Modify Project                                 | 1-39         |
| HDMI Template                                  | 1-42         |
| Required Products                              | 1-42         |
| Template Structure                             | 1-42         |
| Modify Project                                 | 1-43         |
| Frame Buffer with HDMI Template                | 1-45         |
| Required Products                              | 1-45         |
| Template Structure                             | 1-45         |
| Modify Project                                 | 1-46         |
| Stream from FPGA to Processor Template         | 1-49         |
| Required Products                              | 1-49         |
| Template Structure                             | 1-49         |
| Modify Project                                 | 1-50         |
| SDR Template                                   | 1-53         |
| Required Products                              | 1-53         |
| Template Structure                             | 1-53         |
| Modify Project                                 | 1-54         |
| Considerations for Multiple IPs in FPGA Model  | 1-58         |
| Create an SoC Project Application              | 1-59         |
| Project and Top-Level Model                    | 1-60         |
| Software and Task Management on Processor      | 1-63         |
| Processor Model                                | 1-63         |

| Task Processing                                                                                                                                                                                                                                                                                                                                                                                          | 1-64<br>1-66                                                   |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|
| User Logic on FPGA Sample Based Model Top Model                                                                                                                                                                                                                                                                                                                                                          | 1-67<br>1-67<br>1-70                                           |
| Memory and Register Channel Connections                                                                                                                                                                                                                                                                                                                                                                  | 1-72<br>1-72<br>1-73                                           |
| Simulation and Analysis                                                                                                                                                                                                                                                                                                                                                                                  | 1-75                                                           |
| Custom Hardware Board Configuration                                                                                                                                                                                                                                                                                                                                                                      | 1-76                                                           |
| Build Error for Rapid Accelerator Mode                                                                                                                                                                                                                                                                                                                                                                   | 1-77                                                           |
|                                                                                                                                                                                                                                                                                                                                                                                                          | ions                                                           |
| Task Overruns and Countermeasures                                                                                                                                                                                                                                                                                                                                                                        | 2-2<br>2-3                                                     |
|                                                                                                                                                                                                                                                                                                                                                                                                          | 2-2                                                            |
| Reduction of Task Execution Interval  Distribution of Tasks Across Multiple Processor Cores                                                                                                                                                                                                                                                                                                              | 2-2<br>2-3<br>2-3                                              |
| Reduction of Task Execution Interval                                                                                                                                                                                                                                                                                                                                                                     | 2-2<br>2-3<br>2-3<br>2-3                                       |
| Reduction of Task Execution Interval                                                                                                                                                                                                                                                                                                                                                                     | 2-2<br>2-3<br>2-3<br>2-3<br>2-7<br>2-8                         |
| Reduction of Task Execution Interval Distribution of Tasks Across Multiple Processor Cores Dropping Overrunning Tasks  Task Execution Playback using Recorded Data  Task Priority and Preemption Preemption of Low Priority Task by High Priority Task  Multicore Execution and Core Visualization Specify the Core for a Task Core Visualization in Simulation Data Inspector                           | 2-2<br>2-3<br>2-3<br>2-7<br>2-8<br>2-8<br>2-12<br>2-12<br>2-12 |
| Reduction of Task Execution Interval Distribution of Tasks Across Multiple Processor Cores Dropping Overrunning Tasks  Task Execution Playback using Recorded Data  Task Priority and Preemption Preemption of Low Priority Task by High Priority Task  Multicore Execution and Core Visualization Specify the Core for a Task Core Visualization in Simulation Data Inspector Multi-Core Task Execution | 2-2<br>2-3<br>2-3<br>2-7<br>2-8<br>2-8<br>2-12<br>2-12<br>2-14 |

| Memory Channel Latency Plots                                                                                                                                                                                                                                                                                                                  |                                                      |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------|
|                                                                                                                                                                                                                                                                                                                                               | 2-21                                                 |
| Memory Controller Latency Plots                                                                                                                                                                                                                                                                                                               | 2-25                                                 |
| Memory Bandwidth Plots                                                                                                                                                                                                                                                                                                                        | 2-29                                                 |
| Memory Burst Plots                                                                                                                                                                                                                                                                                                                            | 2-31                                                 |
| Pichory Burst 110t3                                                                                                                                                                                                                                                                                                                           | 2-31                                                 |
| Simulation Diagnostics                                                                                                                                                                                                                                                                                                                        | 2-34                                                 |
| Buffer and Burst Waveforms                                                                                                                                                                                                                                                                                                                    | 2-34                                                 |
| External Memory Channel Protocols                                                                                                                                                                                                                                                                                                             | 2-39                                                 |
| AXI4 Stream to Software via DMA                                                                                                                                                                                                                                                                                                               | 2-39                                                 |
| AXI4 Stream FIFO                                                                                                                                                                                                                                                                                                                              | 2-39                                                 |
| AXI4 Stream Video FIFO                                                                                                                                                                                                                                                                                                                        | 2-39                                                 |
| AXI4 Stream Video Frame Buffer                                                                                                                                                                                                                                                                                                                | 2-40                                                 |
| AXI4 Stream video Prame Buller                                                                                                                                                                                                                                                                                                                | 2-40                                                 |
| AXI4 Ralidolli Access                                                                                                                                                                                                                                                                                                                         | <b>2-4</b> 0                                         |
| Record Data from Hardware I/O Devices                                                                                                                                                                                                                                                                                                         | 2-42                                                 |
| Process to Record Data                                                                                                                                                                                                                                                                                                                        | 2-42                                                 |
| Use Memory and I/O Device Data in Processor Simulation                                                                                                                                                                                                                                                                                        | 2-43                                                 |
| Event-Driven Task                                                                                                                                                                                                                                                                                                                             | 2-43                                                 |
| 2.0110 211.011 14011                                                                                                                                                                                                                                                                                                                          |                                                      |
| Timer-Driven Task                                                                                                                                                                                                                                                                                                                             | 2-43                                                 |
| Generate Code and Deploy on SoC De                                                                                                                                                                                                                                                                                                            |                                                      |
|                                                                                                                                                                                                                                                                                                                                               |                                                      |
| Generate Code and Deploy on SoC De                                                                                                                                                                                                                                                                                                            | evice<br>3-2                                         |
| Generate Code and Deploy on SoC Description  Supported Third-Party Tools and Hardware                                                                                                                                                                                                                                                         | 3-2<br>3-2                                           |
| Generate Code and Deploy on SoC De                                                                                                                                                                                                                                                                                                            | evice<br>3-2                                         |
| Generate Code and Deploy on SoC Description  Supported Third-Party Tools and Hardware                                                                                                                                                                                                                                                         | 3-2<br>3-2                                           |
| Generate Code and Deploy on SoC Description  Supported Third-Party Tools and Hardware  Third-Party Synthesis Tools and Version Support  Third-Party Support for Software Generation  Supported Xilinx Devices                                                                                                                                 | 3-2<br>3-2<br>3-2<br>3-2                             |
| Generate Code and Deploy on SoC Description  Supported Third-Party Tools and Hardware                                                                                                                                                                                                                                                         | 3-2<br>3-2<br>3-2<br>3-2<br>3-2                      |
| Generate Code and Deploy on SoC Description  Supported Third-Party Tools and Hardware  Third-Party Synthesis Tools and Version Support  Third-Party Support for Software Generation  Supported Xilinx Devices  Supported Intel Devices  SoC Board Support Packages                                                                            | 3-2<br>3-2<br>3-2<br>3-2<br>3-2<br>3-3               |
| Generate Code and Deploy on SoC Description  Supported Third-Party Tools and Hardware  Third-Party Synthesis Tools and Version Support  Third-Party Support for Software Generation  Supported Xilinx Devices  Supported Intel Devices  SoC Board Support Packages  Code Generation of Software Tasks                                         | 3-2<br>3-2<br>3-2<br>3-2<br>3-2<br>3-3               |
| Generate Code and Deploy on SoC Description  Supported Third-Party Tools and Hardware  Third-Party Synthesis Tools and Version Support  Third-Party Support for Software Generation  Supported Xilinx Devices  Supported Intel Devices  SoC Board Support Packages  Code Generation of Software Tasks  Timer-Driven Tasks                     | 3-2<br>3-2<br>3-2<br>3-2<br>3-3<br>3-4<br>3-4        |
| Generate Code and Deploy on SoC Description  Supported Third-Party Tools and Hardware  Third-Party Synthesis Tools and Version Support  Third-Party Support for Software Generation  Supported Xilinx Devices  Supported Intel Devices  SoC Board Support Packages  Code Generation of Software Tasks                                         | 3-2<br>3-2<br>3-2<br>3-2<br>3-2<br>3-3               |
| Generate Code and Deploy on SoC Description  Supported Third-Party Tools and Hardware  Third-Party Synthesis Tools and Version Support  Third-Party Support for Software Generation  Supported Xilinx Devices  Supported Intel Devices  SoC Board Support Packages  Code Generation of Software Tasks  Timer-Driven Tasks                     | 3-2<br>3-2<br>3-2<br>3-2<br>3-3<br>3-4<br>3-4        |
| Generate Code and Deploy on SoC Description  Supported Third-Party Tools and Hardware  Third-Party Synthesis Tools and Version Support  Third-Party Support for Software Generation  Supported Xilinx Devices  Supported Intel Devices  SoC Board Support Packages  Code Generation of Software Tasks  Timer-Driven Tasks  Event-Driven Tasks | 3-2<br>3-2<br>3-2<br>3-2<br>3-3<br>3-4<br>3-4        |
| Generate Code and Deploy on SoC Design  Supported Third-Party Tools and Hardware  Third-Party Synthesis Tools and Version Support  Third-Party Support for Software Generation  Supported Xilinx Devices  Supported Intel Devices  SoC Board Support Packages  Code Generation of Software Tasks  Timer-Driven Tasks  Event-Driven Task       | 3-2<br>3-2<br>3-2<br>3-2<br>3-3<br>3-4<br>3-4<br>3-4 |

|    | Step 3: Prepare Model for Generation                                          | 3-6<br>3-8               |
|----|-------------------------------------------------------------------------------|--------------------------|
|    | Step 5: Select Build Action                                                   | 3-8                      |
|    | Step 6: Validate Model                                                        | 3-9                      |
|    | Step 7: Build Model                                                           | 3-9                      |
|    | Step 8: Connect Hardware                                                      | 3-9                      |
|    | Step 9: Load and Run                                                          | 3-10                     |
| 1. | Analyze Performance on SoC De                                                 | 4-2                      |
|    | Limitations                                                                   | <b>4-3 4-4</b>           |
|    | Kernel Instrumentation Profiler                                               | <b>4-4</b><br><b>4-6</b> |
|    | Profile Task Execution on Processor         Task Profiling with External Mode | 4-7<br>4-7               |
|    | Memory Performance Information from FPGA Execution                            | 4-9                      |
|    | Memory Performance Plots                                                      | 4-11                     |
|    | Burst Waveforms                                                               | 4-18                     |
|    | Configuring and Querying the AXI Interconnect Monitor                         | 4-19                     |
| 5  | Exam                                                                          | ples                     |
|    | Random Access of External Memory                                              | 5-2                      |
|    | Packet-Based ADS-B Transceiver                                                | 5-11                     |
|    | Histogram Equalization Using Video Frame Buffer                               | 5-25                     |
|    | Streaming Data from Hardware to Software                                      | 5-39                     |
|    | Analyze Memory Bandwidth Using Traffic Generators                             | 5-53                     |

| Record I/O Data from SoC Device     | • • • • • • • • • • • • • • • • • • • • | 5-6  |
|-------------------------------------|-----------------------------------------|------|
| Simulate with I/O Data Recorded fro | om SoC Device                           | 5-7  |
| Task Execution                      |                                         | 5-7  |
| Timer-Driven Task                   |                                         | 5-10 |
| Event-Driven Task                   |                                         | 5-10 |
| Hardware-Software Partitioning of a | •                                       | 5-10 |
|                                     | Supported Hard                          | lwai |

## **Create SoC Models**

- "What is Task Execution?" on page 1-2
- "Event-Driven Tasks" on page 1-4
- "Timer-Driven Task" on page 1-9
- "Kernel Latency" on page 1-12
- "Task Duration" on page 1-16
- "Value and Caching of Task Subsystem Signals" on page 1-20
- "Memory and Register Data Transfers" on page 1-21
- "AXI4-Stream Interface" on page 1-25
- "Simplified AXI4 Master Interface" on page 1-28
- "AXI4-Stream Video Interface" on page 1-32
- "Models with I/O Device Blocks" on page 1-36
- "Use Template to Create SoC Model" on page 1-37
- "HDMI Template" on page 1-42
- "Frame Buffer with HDMI Template" on page 1-45
- "Stream from FPGA to Processor Template" on page 1-49
- "SDR Template" on page 1-53
- "Considerations for Multiple IPs in FPGA Model" on page 1-58
- "Create an SoC Project Application" on page 1-59
- "Project and Top-Level Model" on page 1-60
- "Software and Task Management on Processor" on page 1-63
- "User Logic on FPGA" on page 1-67
- "Memory and Register Channel Connections" on page 1-72
- "Simulation and Analysis" on page 1-75
- "Custom Hardware Board Configuration" on page 1-76
- "Build Error for Rapid Accelerator Mode" on page 1-77

## What is Task Execution?

A task is a unit of execution or unit of work in a software application. Typically, task execution in an embedded processor is managed by the operating system (OS). When deployed to the embedded processor, a task corresponds to an OS thread. The SoC Blockset defines the execution life cycle and relation to OS threads as follows.

## **Task Execution Life Cycle**

The life cycle of a task can be divided into five states:

- *Created* The system creates all the tasks when the application starts and immediately moves them to the waiting state.
- Waiting The task waits for the associated trigger signal, such as an OS timer or I/O device. After receiving the trigger signal, the task starts to run. If the task has the highest priority, it enters the running state. Otherwise, the task continues to wait until it becomes the highest priority, triggered task.
- Running The task executes its code. When the code completes execution, the task immediately moves to the waiting state. If a trigger for a higher-priority task occurs, the running task moves to the preempted state.
- *Preempted* The task is preempted and waiting to run. A task runs based on a combination of the task priority and the order the task entered the *Preempted* state. Assuming equal task priorities of all other tasks in *Ready to Resume* state, tasks run based on first-in-first-out (FIFO) ordering.
- *Terminated* Tasks terminate when the application ends.

This figure shows the state diagram of a task execution life cycle for an application using an OS. For simplicity, the terminated state is not shown, but a task can reach the terminated state from any of the other states.



#### Task and Thread

A task is a conceptual unit of work in an algorithm. In an application executing on a device, a task is a section of code that executes in a thread within an operating system (OS). The OS thread determines the state of execution of the task. Within the SoC Blockset, a task specifically refers to the portion of the Simulink® model contained within a rate or function-call subsystem. The trigger signal for that subsystem comes from a Task Manager block. When deployed to hardware, an OS thread uses the task properties. The thread executes the code generated from the subsystem. Conceptually, a Task in simulation is equivalent to a thread in generated code.

## See Also

Task Manager

#### **More About**

- "Timer-Driven Task" on page 1-9
- "Event-Driven Tasks" on page 1-4

## **External Websites**

• Task (computing)

## **Event-Driven Tasks**

Event-driven tasks start executing when triggered by an external event. Events can include internal events, such as memory stream or register writes, or external events, such as receiving a UDP data packet from a network connection. Assuming no other tasks are executing at the time of the event or the task has the highest priority, the event-driven task can respond immediately to the event. The task can then process the received data, and potentially generate other events in the model.

#### Create a Simulink Model with an Event Driven Task

This example shows how to create and configure a Simulink® model to use the event driven task feature of the SoC Blockset.

#### **Create a Software Reference Model**

This section shows how to create a reference model of the software for an SoC application model. The software contains an event driven task subsystem that reacts to receiving UDP packets.

- 1 Create a new blank model.
- In the Simulink editor, add a Function-Call Subsystem block to the model. Connect an Inport block to the input port of the Function-Call Subsystem block. Connect the output port to a Terminator block.
- Add an Asynchronous Task Specification block to the model. On the Block Parameters dialog box, set the **Task priority** to 41.
- 4 Connect the output port of Asynchronous Task Specification block to the function() input of the Function-Call Subsystem block.
- Add an Inport block and open the Block parameters dialog box. On the **Signal**Attributes tab, check **Output function call**. Connect the Inport block to the input port of the Asynchronous Task Specification block.
- **6** Open the Function-Call subsystem model.
- 7 Add a UDP Read block to model. Open the Block Parameters dialog box, set Maximum data length (elements) to 1024 and check Enable event-based execution.
- 8 Connect the Inport block to the UDP Read block **UDP Data** port. Connect the **Data** port to the Outport block. Connect the **Length** port to a Terminator block.

- Open the Configuration Parameters dialog box, select the Solver pane. Set Solver selection > Type to Fixed-step and check Tasking and sample timer options > Higher priority value indicates higher task priority.
- 10 Select the Hardware Implementation pane, set Hardware board to Zedboard.
- 11 Save the model as soc\_task\_createeventdriventask\_software.slx.

The completed model should look similar to the following model.



#### Create the SoC Application Model

This section shows how to create the top level SoC application model that contains the software reference model developed in the previous section.

- 1 Create a new blank model.
- In the Simulink editor, add a Model block. On the Block Parameters dialog box, set **Model name** to soc task createeventdriventask software.slx.
- Add a Task Manager block and open the Block Parameters dialog box. Set the Main
   Type to Event-driven and Main > Priority to 41. Each newly added event-driven task exposes an event message input port on the Task Manager block.
- **4** (Optional) On the **Simulation** tab, you specify the task duration for that task. For more information on setting task duration, see "Task Duration" on page 1-16.
- 5 In the editor, add an IO Data Source block to the model. Open the Block Parameters dialog box and enable **Show event port**.
- 6 Connect the IO Data Source block Event port to the Task Manager and the UDP Data port to the UDP Data Message port on the Model reference block.
- 7 Open the Configuration Parameters dialog box, select the Solver pane. Set Solver selection > Type to Fixed-step and check Tasking and sample timer options > Higher priority value indicates higher task priority.

- 8 Select the **Hardware Implementation** pane, set **Hardware board** to Zedboard.
- **9** Update the diagram, press **Ctrl+D**.
- **10** Save the model as soc\_task\_createeventdriventask\_application.slx.

The completed model should look similar to the following model.



#### Run the Model with Event Driven Task

In the Simulink editor, run the

soc\_task\_createeventdriventask\_application.slx model. When the run completes, open the Simulation Data Inspector (SDI) and select **Task1**. The SDI display shows that **Task1** triggers and executes each time a new UDP packet arrives. Although superficially the task execution appears periodic, this is only a byproduct of the current default settings of the IO Data Source block that generates the event with a time step of 0.1.



## **See Also**

I/O Data Source | Task Manager

## **More About**

- "What is Task Execution?" on page 1-2
- "Timer-Driven Task" on page 1-9

## **Timer-Driven Task**

Timer-driven tasks execute at a periodic rate equal to an integer multiple of the Simulink model fundamental sample time.

To create a timer-driven task, connect the task port of a Task Manager block to a periodic event port on a Model block. Each rate in a Model block generates a unique model periodic event port with the time step for the rate shown on the block icon. In the Model block dialog mask, use the **Schedule rates** parameter to enable model periodic event ports.

**Note** A timer-driven task requires a lower priority than an event-driven task.

#### Create a Simulink Model with an Timer Driven Task

This example shows how to create and configure a Simulink model to use the timer driven task feature of the SoC Blockset.

#### **Create a Software Reference Model**

This section shows how to create a reference model of the software for an SoC application model. The software contains one timer driven task subsystem that reacts to receiving UDP packets.

- 1 Create a new blank model.
- In the Simulink editor, add a Subsystem block to the model. Add a Sine block and connect it to the Subsystem block. Connect the output of the Subsystem block to a Terminator block.
- **3** Open the Function-Call subsystem model.
- 4 Open the Block parameters dialog box of the Inport block, set the **Sample Time** to 0.1.
- 5 In the Simulink editor, open the Configuration Parameters dialog box.
- 6 Select the **Hardware Implementation** pane, set **Hardware board** to Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit.
- 7 Save the model as soc\_task\_createtimerdriventask\_software.slx.

The completed model should look similar to the following model.



#### **Create the SoC Application Model**

This section shows how to create the top level SoC application model that contains the software reference model developed in the previous section.

- **1** Create a new blank model.
- 2 In the Simulink editor, add a Model block and open the Block Parameters dialog box.
- 3 Check Main > Schedule Rates and set Main > Model name to soc task createtimerdriventask software.slx.
- 4 In the editor, add a Task Manager block to the model.
- 5 (Optional) Open the Block Parameters dialog box of the Task Manager block. By default, the task **Type** is Timer-driven with a **Period** of 0.1. On the **Simulation** tab, you specify the task duration for that task. For more information on setting task duration, see Task Duration.
- 6 In the editor, connect the **Task1** port to the **D1[0.1]** port of the Model block.
- 7 Open the Configuration Parameters dialog box, select the Hardware Implementation pane, set Hardware board to Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit.
- 8 Update the diagram, press Ctrl+D.
- **9** Save the model as soc\_task\_createtimerdriventask\_application.slx.

The completed model should look similar to the following model.



#### Run the Model with Timer Driven Task

In the Simulink editor, run the

soc\_task\_createtimerdriventask\_application.slx model. When the run completes, open the Simulation Data Inspector (SDI) and select **Task1**. The SDI display shows that **Task1** triggers each 0.1 time steps.



## See Also

Task Manager

#### **More About**

- "What is Task Execution?" on page 1-2
- "Event-Driven Tasks" on page 1-4

## **Kernel Latency**

In a deployed application, switching between threads requires a finite amount of time depending on the current state of the thread, embedded processor, and OS. *Kernel latency* defines the time required for the operating system to respond to a trigger signal, stop execution of any running threads, and start the execution of the thread responsible for the trigger signal.

SoC Blockset models simulate *Kernel latency* as a delay at the start of execution of a task the first time the task moves from the waiting to running state. The following diagram shows the execution timing of a high-priority and low-priority task on a system that simulates a single processor core.



Other factors affecting kernel latency, such as context switch times, can be considered negligible compared to other effects and are not modeled in simulation.

**Note** Kernel latency requires advanced knowledge of the processor specifications and can be generally set to 0 without impact to the simulation.

## **Effect Kernel Latency on Task Execution**

This example shows the effect of kernel latency on the behavior and timing of two timer driven tasks in an SoC application.

The following model simulates a software application with two timer driven tasks. The task characteristics, specified in the Task Manager block, are as follows:

| Name             | Period | Mean Duration |
|------------------|--------|---------------|
| HighPriorityTask | 0.01   | 0.004         |
| LowPriorityTask  | 0.03   | 0.007         |

With these timing conditions, the high priority task preempts the low priority task. In the model Configuration Parameters dialog box, the **Hardware Implementation** > **Operating system/scheduler** > **Kernel latency** is set to 0.002.



Run the model and open the Simulation Data Inspector (SDI) display. Selecting the two task signal produces the following display.



Inspecting the SDI display, a change in task state from *Waiting* to *Running* shows a latency of 0.002 seconds. However, when the task changes from *Preempted* to *Running*, no latency occurs. This timing matches with the expected behavior of task, experiencing a latency in startup of that task execution instance, but not when the task instance already exists.

## **See Also**

Task Manager

#### **More About**

- "What is Task Execution?" on page 1-2
- "Task Duration" on page 1-16

## **Task Duration**

The total time an instance of a task spends in the running state defines the task duration. Task duration can vary due to multiple sources, in particular:

- Conditional branching in the task algorithm
- Dependence on signal values from other tasks
- Dependence on signals from external sources, such as I/O devices or hardware user logic
- Compiler settings and SoC device processor architecture

As a result, task duration for any given task instance can be nondeterministic.

The Task Manager block provides four ways to simulate the nondeterministic task duration: approximation using a parameterized probability distribution, approximation using a calculated probability distribution, and replay of recorded task execution timing data.

## **Approximation Using Parameterized Probability Distribution**

In simulation, the Task Manager block can define the task duration as random variable expressed as the weighted sum of truncated normal distributions. For example, this diagram shows the probability distribution of a task that executes with a short task duration, but can occasionally execute with a longer durations.



To create a probability distribution for a task duration, first open the Task Manager block dialog. Then, on the **Simulation** tab, set **Specify task duration via:** to Dialog. In the **Task duration settings** section, you can set the properties of each distribution by editing the text of that property. You can also add and delete probability distributions from the sum of distributions by clicking the **Add** and **Delete** buttons, respectively.



#### Note

- The sum of the Percent weights must equal 100.
- Each task can use a maximum of 5 distributions.

## **Approximation Using Calculated Probability Distribution**

Each recording of task execution data, either from a previous simulation or from execution on an SoC device, generates several profiling files. The metadata.csv file contains the calculated mean and standard deviation for each task in that recording. To configure a task in the Task Manager block to use the derived statistical data for task duration, follow these steps:

- 1 Open the Task Manager block dialog mask.
- 2 On the Simulation tab, set Specify task duration via to Recorded task diagnostics file.

- 3 Specify the location and name of the metadata.csv file. The **Mean** and **Deviation** parameters are automatically updated with the data from the file.
- 4 Click OK.

## **Specification from Task Manager Input Port**

An input port on the Task Manager block dynamically specifies the task duration. To expose this task duration input port, follow these steps:

- 1 Open the Task Manager block dialog mask.
- 2 On the **Simulation** tab, set **Specify task duration via** to **Input** port.
- 3 Click **OK** to expose a new input port, named **TaskNameDur**, on the block.

## Replay of Recorded Task Execution Timing Data

A data file provides exact task duration for each task execution instance. A task execution data file can come from a previous or independent model simulation or directly from the task execution on a processor in an SoC device. For more information on replaying recorded task execution timing data, see "Task Execution Playback using Recorded Data" on page 2-7.

## See Also

Task Manager

#### **More About**

- "What is Task Execution?" on page 1-2
- "Task Execution Playback using Recorded Data" on page 2-7

#### **External Websites**

Truncated Normal Distribution

## Value and Caching of Task Subsystem Signals

In SoC Blockset, a task subsystem can be treated as an independent model with the task duration simulating the expected execution time on an SoC device. When the Task Manager block executes a task, input signals connected to that task subsystem can either be sampled and cached at the start of the task execution or sampled at the end of the task execution instance. The task subsystem then executes using either the cached or latest value. The value of signals and buses output from the subsystem change at the end of the task execution instance.

To enable task subsystem input signal caching, first open the Simulink configuration parameters on the processor reference model. On the **Hardware Implementation** pane, select **Hardware board settings** > **Task and memory simulation** > **Cache input data at task start**.

## See Also

Task Manager

#### **More About**

- "What is Task Execution?" on page 1-2
- "Task Duration" on page 1-16
- "Kernel Latency" on page 1-12

## **Memory and Register Data Transfers**

An SoC application is composed of one or more algorithms. When an algorithm transfers data to another algorithm, the data is represented as a signal line in Simulink. For behavioral models, the data transfer is instantaneous.

This diagram shows a behavioral model of a datapath between two algorithms.



In the physical world, the algorithms can be on two separate devices, and data transfers do not happen instantaneously. Furthermore, the algorithms can run at different rates, and therefore require buffering and control logic for handshaking. For example, a simple handshake such as "data is valid" from the producer of the data and "ready to accept data" from the consumer serve as control logic.

If one processing element executes in an FPGA or ASIC, and the next processing element executes on an embedded processor, then a simple signal line represents more than just a complex hardware datapath. The data transfer also represents a processor interrupt handler, an operating system task scheduler, and a software driver stack.

In SoC Blockset, you model data transfers and handshake protocols through shared memory. Use a Memory Channel block for external memory or a Register Channel block for registers.

## **Modeling Datapath with Memory Channel Block**

The Memory Channel block represents an abstraction to a complex datapath through external memory and supports different handshake protocols. It facilitates a refinement of the communication between processing elements from an instantaneous, protocol-less wire to a full direct memory access (DMA) connection between a processor and an FPGA.

By adding a Memory Channel block, you can model data movement from one part of the algorithm to another.



The block provides a model of the communication pipeline. The channel also provides a signaling interface.



The interface protocol depends on where the processing is executed. An FPGA or ASIC algorithm can perform data transfers by using standard protocols such as AXI4-Stream or AXI4. An embedded CPU algorithm can use a driver-interface exported to the user space.

This figure shows a model of the datapath from an FPGA algorithm streaming data to a processor algorithm.



Other Memory Channel type selections model additional common datapaths through external memory. For more information about Memory Channel configurations, see Memory Channel.

The writer and reader are connected to the memory and request access to the external memory from a memory controller. For more information about the Memory Controller block, see Memory Controller.

## **Modeling Datapath with Register Channel Block**

The Register Channel block represents the serialization of the processor reads or writes through a common configuration bus such as AXI-Lite.

The Register Channel block provides a timing model for the transfer of register values between processor and hardware algorithms through a common configuration bus. Use this block when the processor writes a command or configuration register or when the processor reads a status register.

## See Also

Memory Channel | Register Channel

## **More About**

• "External Memory Channel Protocols" on page 2-39

## **AXI4-Stream Interface**

Using SoC Blockset, you can model a simplified, streaming protocol in your model. Use HDL Coder $^{\text{\tiny TM}}$  to generate AXI4-Stream interfaces in the IP core.

## **Simplified Streaming Protocol**

When you want to generate an AXI4-Stream interface in your IP core, in your DUT interface, implement the following signals:

- Data
- Valid

When you map scalar DUT ports to an AXI4-Stream interface, you can optionally model the following signals and map them to the AXI4-Stream interface:

- · Ready
- Other protocol signals, such as:
  - TSTRB
  - TKEEP
  - TLAST
  - TID
  - TDEST
  - TUSER

## **Data and Valid Signals**

When the Data signal is valid, the Valid signal is asserted.



## **Ready Signal (Optional)**

The AXI4-Stream interfaces in your DUT can optionally include a Ready signal. In a Slave interface, the Ready signal enables you to apply back pressure. In a Master interface, the Ready signal enables you to respond to back pressure.

If you model the Ready signal in your AXI4-Stream interfaces, your Master interface ignores the Data and Valid signals one clock cycle after the Ready signal is deasserted. You can start sending Data and Valid signals once the Ready signal is asserted. You can send one more Data and Valid signal after the Ready signal is deasserted.

If you do not model the Ready signal, HDL Coder generates the signal and the associated back pressure logic.



For example, if you have a FIFO in your DUT to store a frame of data, to apply back pressure to the upstream component, you can model the Ready signal based on the FIFO Full signal.



## See Also

Memory Channel | SoC Bus Creator

## **More About**

- "External Memory Channel Protocols" on page 2-39
- "Simplified AXI4 Master Interface" on page 1-28
- "AXI4-Stream Video Interface" on page 1-32

## **Simplified AXI4 Master Interface**

#### In this section...

"Simplified AXI4 Master Protocol - Write Channel" on page 1-28

"Simplified AXI4 Master Protocol - Read Channel" on page 1-30

For designs that require accessing large data sets from an external memory, model your algorithm with a simplified AXI4 Master protocol. When you run the IP Core Generation workflow, HDL Codergenerates an IP core with AXI4 Master interfaces. The AXI4 Master interface can communicate between your design and the external memory controller IP by using the AXI4 Master protocol.

## **Simplified AXI4 Master Protocol - Write Channel**

You can use the simplified AXI4 Master protocol to map to AXI4 Master interfaces. Use the simplified AXI4 Master write protocol for a write transaction and the simplified AXI4 Master read protocol for a read transaction.

This figure shows the timing diagram for the signals that you model at the DUT input and output interfaces for an AXI4 Master write transaction.



The DUT waits for wr\_ready to become high to initiate a write request. When wr\_ready becomes high, the DUT can send out the write request. The write request consists of the Data and Write Master to Slave bus signals. This bus consists of wr\_len, wr\_addr, and wr\_valid. wr\_addr specifies the starting address that DUT wants to write to. The wr\_len signal corresponds to the number of data elements in this write transaction. Data can be sent as long as wr\_valid is high. When wr\_ready becomes low, the DUT must stop sending data within one clock cycle, and the Data signal becomes invalid. If the DUT continues to send data after one clock cycle, the data is ignored.

#### **Output Signals**

Model the Data and Write Master to Slave bus signals at the DUT output interface.

- Data: The data that you want to transfer, valid each cycle of the transaction.
- Write Master to Slave bus that consists of:
  - wr\_addr: Starting address of the write transaction that is sampled at the first cycle of the transaction.
  - wr\_len: The number of data values that you want to transfer, sampled at the first cycle of the transaction.
  - wr\_valid: When this control signal becomes high, it indicates that the Data signal sampled at the output is valid.

#### **Input Signals**

Model the Write Slave to Master bus that consists of:

- wr\_ready: This signal corresponds to the backpressure from the slave IP core or
  external memory. When this control signal goes high, it indicates that data can be
  sent. When wr\_ready is low, the DUT must stop sending data within one clock cycle.
  You can also use the wr\_ready signal to determine whether the DUT can send a
  second burst signal immediately after the first burst signal has been sent. Multiple
  burst signals are supported, which means that the wr\_ready signal remains high to
  accept the second burst immediately after the last element of the first burst has been
  accepted.
- wr\_bvalid (optional signal): Response signal from the slave IP core that you can use
  for diagnosis purposes. The wr\_bvalid signal becomes high after the AXI4
  interconnect accepts each burst transaction. If wr\_len is greater than 256, the AXI4
  Master write module splits the large burst signal into 256-sized bursts. wr\_bvalid
  becomes high for each 256-sized burst.
- wr\_bresp (optional signal): Response signal from the slave IP core that you can use for diagnosis purposes. Use this signal with the wr\_bvalid signal.
- wr\_complete (optional signal): Control signal that when remains high for one clock cycle indicates that the write transaction has completed. This signal asserts at the last wr\_bvalid of the burst.

### Simplified AXI4 Master Protocol - Read Channel

This figure shows the timing diagram for the signals that you model at the DUT input and output interfaces for an AXI4 Master read transaction. These signals include the Data, Read Master to Slave Bus, and Read Slave to Master Bus.



The DUT waits for rd\_aready to become high to initiate a read request. When rd\_aready is high, the DUT can send out the read request. The read request consists of the rd\_addr, rd\_len, and rd\_avalid signals of the Read Master to Slave bus. The slave IP or the external memory responds to the read request by sending the Data at each clock cycle. The rd\_len signal corresponds to the number of data values to read. The DUT can receive Data as long as rd\_dvalid is high.

#### Read Request

To model a read request, at the DUT output interface, model the Read Master to Slave bus that consists of:

- rd\_addr: Starting address for the read transaction that is sampled at the first cycle of the transaction.
- rd\_len: The number of data values that you want to read, sampled at the first cycle of the transaction.
- rd\_avalid: Control signal that specifies whether the read request is valid.

At the DUT input interface, implement the rd\_aready signal. This signal is part of the Read Slave to Master bus and indicates when to accept read requests. You can monitor the rd\_aready signal to determine whether the DUT can send consecutive burst requests. When rd\_aready becomes high, it indicates that the DUT can send a read request in the next clock cycle.

#### **Read Response**

At the DUT input interface, model the Data and Read Slave to Master bus signals.

- Data: The data that is returned from the read request.
- Read Slave to Master bus that consists of:
  - rd\_dvalid: Control signal which indicates that the Data returned from the read request is valid.
  - rd\_rvalid (optional signal): response signal from the slave IP core that you can use for diagnosis purposes.
  - rd\_rresp (optional signal): Response signal from the slave IP core that indicates the status of the read transaction.

At the DUT output interface, you can optionally implement the rd\_dready signal. This signal is part of the Read Master to Slave bus and indicates when the DUT can start accepting data. By default, if you do not map this signal to the AXI4 Master read interface, the generated HDL IP core ties rd\_dready to logic high.

#### See Also

Memory Channel | SoC Bus Creator

#### **More About**

- "External Memory Channel Protocols" on page 2-39
- "AXI4-Stream Interface" on page 1-25
- "AXI4-Stream Video Interface" on page 1-32

### **AXI4-Stream Video Interface**

#### In this section...

"Streaming Pixel Protocol" on page 1-32

"Protocol Signals and Timing Diagrams" on page 1-33

Using SoC Blockset, you can implement a simplified, streaming pixel protocol in your model. Use HDL Coder to generate an HDL IP core with AXI4-Stream Video interfaces.

### **Streaming Pixel Protocol**

You can use the streaming pixel protocol for AXI4-Stream Video interface mapping. Video algorithms process data serially and generate video data as a serial stream of pixel data and control signals.

To generate an IP core with AXI4-Stream Video interfaces, in your DUT interface, implement these signals:

- Pixel Data
- Pixel Control Bus

The **Pixel Control Bus** is a bus that has these signals:

- hStart
- hEnd
- vStart
- vEnd
- valid

The signals **hStart** and **hEnd** represent the start of an active line and the end of an active line respectively. The signals **vStart** and **vEnd** represent the start of a frame and the end of a frame.

You can optionally model the backpressure signal, **Ready**, and map it to the AXI4-Stream Video interface.

### **Protocol Signals and Timing Diagrams**

This figure is a 2-by-3 pixel image. The active image area is the rectangle with a dashed line around it and the inactive pixels that surround it. The pixels are labeled with their grayscale values.



#### **Pixel Data and Pixel Control Bus**

This figure shows the timing diagram for the **Pixel Data** and **Pixel Control Bus** signals that you model at the DUT interface.



The **Pixel Data** signal is the primary video signal that is transferred across the AXI4-Stream Video interface. When the **Pixel Data** signal is valid, the **valid** signal is asserted.

The hStart signal becomes high at the start of the active lines. The hEnd signal becomes high at the end of the active lines.

The **vStart** signal becomes high at the start of the active frame in the second line. The **vEnd** signal becomes high at the end of the active frame in the third line.

#### **Optional Ready Signal**

This figure shows the timing diagram for the **Pixel Data**, the **Pixel Control Bus**, and the **Ready** signal that you model at the DUT interface.



When you map the DUT ports to an AXI4-Stream Video interface, you can optionally model the backpressure signal, **Ready**, and map it to the AXI4-Stream Video interface.

In a Slave interface, with the **Ready** signal, you can apply back pressure. In a Master interface, with the **Ready** signal, you can respond to back pressure.

If you model the **Ready** signal in your AXI4-Stream Video interfaces, your Master interface must deassert its **valid** signal one cycle after the **Ready** signal is deasserted.

If you do not model the **Ready** signal, HDL Coder generates the associated backpressure logic.

### See Also

Memory Channel | SoC Bus Creator

### **More About**

- "External Memory Channel Protocols" on page 2-39
- "Simplified AXI4 Master Interface" on page 1-28
- "AXI4-Stream Interface" on page 1-25

## Models with I/O Device Blocks

The **Processor I/O** sublibrary in the SoC Blockset contains blocks that simulate the data transfer between the processor system and other I/O devices in the SoC application. **Processor I/O** blocks, including the Register Read, Register Write, and Stream Read, can read and write data to internal I/O, such as DDR memory or registers, on the SoC device. Similarly, the TCP Read, TCP Write, UDP Read, and UDP Write blocks can read and write data to external communication I/O devices.

### **Processor I/O with Memory Model**

**Processor I/O** blocks, including the Register Read, Register Write, and Stream Read, simulate reading and writing of data to registers and shared memory.

#### Processor I/O with Network Device Model

**Processor I/O** blocks, including the TCP Read, TCP Write, UDP Read, and UDP Write blocks, simulate the exchange of data to external communication protocols supported by an SoC device.

### See Also

# **Use Template to Create SoC Model**

SoC Blockset model templates provide design patterns and best practices for models intended for simulation, HDL code generation, or SoC deployment. Models created from any one of SoC Blockset model templates have their configuration parameters set up for simulation and code generation.

### **Create SoC Model Using SoC Blockset Template**

To efficiently model hardware for SoC design, create a project by using an SoC Blockset template.

- In the MATLAB® Home tab, select the **Simulink** button. Alternatively, at the command line, enter:
  - simulink
- On the Simulink Start Page, scroll down to the SoC Blockset section, which contains links to SoC templates for common workflows. Select a template and save the project. A project folder opens in your workspace containing:
  - A model with the name soc\_\*\_top.slx The top-level model for the SoC project.
  - referencedmodels A folder containing the models referenced from the top model. Some templates include an FPGA model and a processor model. Other templates only include one referenced model: an FPGA model or a processor model.
  - utilities A folder containing utility functions or testbench data used by the model.

To open the top-level model in Simulink, on the **Project Shortcuts** tab, click **Open Top model**.



3 In each template, navigate to the blocks marked **FPGA Algorithm** in the FPGA model, or **Processor Algorithm** in the processor model. These blocks are highlighted for easy detection. Replace the template blocks with your own algorithm model.

**Tip** To easily find the algorithm blocks, follow the annotations throughout the model hierarchy.

4 To open the **SoC Blockset** Block Library, select the Library Browser button, then select **SoC Blockset** in the left pane. Alternatively, at the command line, enter:

soclib

This library includes blocks for creating SoC models and testbenches.

#### **Template Structure**

The top model in an SoC Blockset template includes an FPGA subsystem, which represents the logic intended to program the FPGA. The FPGA subsystem includes two Simulink model variants:

- Frame-based FPGA model For enhanced simulation performance
- Sample-based FPGA model For cycle accuracy and code generation

The top model also includes a processor subsystem, which represents the software program intended to run on the SoC processor. Both the FPGA and the top subsystems contain blocks marked as FPGA Algorithm or Processor Algorithm. Replace these algorithms with your own logic. The top model of the template also includes a memory system, with a memory controller and memory channels. These blocks represent the physical memory system on the board. The model often includes a register channel (to enable communication between the processor and FPGA), testbench, or I/O blocks.

### **Modify Project**

#### **Modify the FPGA Model**

From the top model, open the FPGA model by clicking the arrow at the bottom left of the FPGA block:



The FPGA model contains two model variants: a frame-based variant and a sample-based variant. Double-click the model variant you want to modify. The FPGA model typically includes two main subsystems for you to modify:

- FPGA Algorithm Wrapper Double-click to open the model. The algorithm wrapper
  contains a green-highlighted subsystem named FPGA Algorithm. This block has two
  inputs and one output and is implemented as a multiplier. Replace this block with your
  own FPGA algorithm. Add inputs and outputs as required.
- Test Source Wrapper This block includes a test source and is intended to generate stimulus as an input to the FPGA algorithm. Modify the test source to your needs, or replace it with an alternative source block. If the input to your FPGA algorithm is routed from an I/O block, such as HDMI or SDR, consider using a specific application template.

**Note** Not all templates include a Test Source block in the FPGA model.

#### **Modify the Processor Model**

The processor model includes a Task Manager block and a processor wrapper. The template implements the processor algorithm as a "pass through" wire. Open the processor algorithm wrapper, and replace the Processor Algorithm block (highlighted in blue) with your desired algorithm.

#### **Modify the Register Channel**

The top model of a template also includes a register channel to communicate between the processor and the FPGA model. Use the register channel to configure the FPGA mode,l or to read and check status registers. The Register Channel block in the template includes one register. To add additional registers you must modify the register channel block parameters, the FPGA algorithm, and the processor algorithm:

- Add registers to the register channel Double-click the Register Channel block to open the block mask and add additional registers as needed. Adding registers creates additional ports on the Register Channel block. For additional information, see Register Channel.
- Add ports to the processor model Navigate to the Processor Algorithm Wrapper model. To navigate to the processor model, click **Open Processor model** on the **Project Shortcuts** tab. Double-click Processor Algorithm Wrapper to modify it.

For write registers, add an output port to the module and add logic to drive a value to the added output port. For read registers, add an input port and logic to process the

- information returned from a read register. From the top model, wire the port to the Register Channel block.
- 3 Add ports to the FPGA model Navigate to the FPGA Algorithm Wrapper model. To navigate to the FPGA/Frame based processing model, click Open FPGA sample model on the Project Shortcuts tab. Double-click FPGA Algorithm Wrapper to modify it.
  - For write registers, add an input port to the module and logic to process the information returned from a read register. For read registers, add an output port and logic to drive a value to the added output port.
  - For equivalent behavior when using a Simulink sample-based variant, repeat this step for the sample-based processing model in the FPGA wrapper.
- **4** From the top model, wire the new port to the Register Channel block.

### See Also

Memory Controller | Memory Channel | Register Channel | Task Manager

#### **More About**

- "Stream from FPGA to Processor Template" on page 1-49
- "SDR Template" on page 1-53
- "HDMI Template" on page 1-42
- "Frame Buffer with HDMI Template" on page 1-45

# **HDMI Template**

The High-Definition Multimedia Interface (HDMI) template provides a simulation model for SoC video streaming using SoC Blockset Support Package for Xilinx® Devices. Use this template to simulate and analyze the effects of internal and external connectivity, such as HDMI I/O behavior on a vision processing algorithm.



#### **Required Products**

- Computer Vision Toolbox<sup>™</sup>
- Vision HDL Toolbox<sup>™</sup>
- SoC Blockset Support Package for Xilinx Devices

### **Template Structure**

HDMI video streams from an HDMI Rx block into the FPGA, which implements a video data processing algorithm. The processed images stream to the HDMI Tx block.

FPGA pixel model uses VideoStream Connector blocks to connect different subsystems and to connect to the HDMI I/O blocks. VideoStream Connector is required to generate each subsystem as a separate IP in the implemented reference design from the model. Since the FPGA frame model is for simulation purposes only and is not used for implementation, the Video stream connector blocks are not modeled.

### **Modify Project**

In MATLAB, on the **Project Shortcuts** tab, click **Open FPGA pixel model**. Open the FPGA Algorithm Wrapper, as shown highlighted in green.



The FPGA Algorithm, also highlighted in green, contains feedthrough ports and signals.



You can modify the content of the FPGA algorithm model to incorporate your desired vision processing algorithm, with complete simulation and code generation of the surrounding video memory system. For pure algorithm design and investigation, click **Open FPGA frame model** in the **Project Shortcuts** tab, and repeat this step.

### See Also

"Use Template to Create SoC Model" on page 1-37 | "Create a New Project Using Templates" (Simulink)

### **More About**

• "What Are Projects?" (Simulink)

# Frame Buffer with HDMI Template

The Frame Buffer with High-Definition Multimedia Interface (HDMI) template creates a Simulink project with models to simulate and generate a video application with external memory frame buffer. This template forms the base for the "Histogram Equalization Using Video Frame Buffer" on page 5-25 example. Use this template to simulate the full reference design of a video processing application on an FPGA with HDMI I/O and connection to an external memory frame buffer for advanced image processing designs.



### **Required Products**

- Vision HDL Toolbox
- Computer Vision Toolbox
- SoC Blockset Support Package for Xilinx Devices

#### **Template Structure**

HDMI video streams video data from an HDMI Rx block into the FPGA. The FPGA implements a color-space transformation and your image processing algorithm. The processed images then undergo the inverse color-space transformation and stream to the

HDMI Tx block. The FPGA algorithm is connected to the external memory frame buffer Memory Channel block configured in AXI4-Stream Video Frame Buffer mode.

The FPGA pixel model uses Video Stream Connector blocks to connect different subsystems and to connect to HDMI I/O blocks. This is required to be able to generate each subsystem as a separate IP in the implemented reference design from the model. Since the FPGA frame model is for simulation purposes only and is not used for implementation, the Video Stream Connector blocks are not modeled.

### **Modify Project**

In MATLAB, on the **Project Shortcuts** tab, click **Open FPGA pixel model**. Double-click to open the FPGA Algorithm Wrapper.



The FPGA Algorithm, highlighted in green, contains feedthrough ports and signals.



Modify the content of the FPGA Algorithm subsystem to incorporate your desired vision processing algorithm, with complete simulation and code generation of the surrounding video memory system. The **pixelToFrameBuf** and **pixelFromFrameBuf** ports provide access to the external memory channel, Frame Buffer. For pure algorithm design and investigation, in the **Project Shortcuts** tab, click **Open FPGA frame model**, and repeat this step.

#### See Also

"Use Template to Create SoC Model" on page 1-37 | "Create a New Project Using Templates" (Simulink)

#### **More About**

"What Are Projects?" (Simulink)

• "Histogram Equalization Using Video Frame Buffer" on page 5-25

# Stream from FPGA to Processor Template

Use the *Stream from FPGA to Processor* template to create an SoC Blockset model for designing a datapath from hardware (FPGA) to software (Processor). To create a project using the "Stream to Processor" template, follow the steps to "Create SoC Model Using SoC Blockset Template" on page 1-37.



### **Required Products**

For sample-based processing, no additional products are required.

For frame-based processing, DSP System Toolbox $^{\text{\tiny TM}}$  is required.

### **Template Structure**

This template models a counter as the test data source and minimal logic for the FPGA and processor algorithms. Use this template as a guide and replace the FPGA algorithm and Processor algorithm with your own functionality. The FPGA algorithm is a simple multiplication performed on input data from the test source and from a **configReg** 

parameter. The processor writes the **configReg**. This parameter is modeled using the Register Channel block. Data from the FPGA algorithm is passed to the processor through a Memory Channel block. The memory **Channel Type** parameter is set to AXI4-Stream to Software via DMA, which models the DMA data transfer through shared external memory.

The processor reads the computed data from the memory and performs additional computing, which is implemented in the template as a pass-through wire. You can view the simulation results by double-clicking the Vector Scope block in the testbench sink.

### **Modify Project**

#### **Modify the FPGA Model**

In the MATLAB toolstrip, on the **Project Shortcuts** tab, click **Open FPGA sample model** to open the FPGA model. In the model, two areas are highlighted green, which represents user code: one in the FPGA Algorithm Wrapper block and one in the Test Source Wrapper block.



- FPGA Algorithm Wrapper Double-click to open the model. The algorithm wrapper
  contains a green-highlighted subsystem named FPGA Algorithm. This block has two
  inputs and one output and is implemented as a multiplier. Replace this block with your
  own FPGA algorithm. Add inputs and outputs as required.
- Test Source Wrapper This block includes a test source and is intended to generate stimulus as an input to the FPGA algorithm. This block is implemented as a counter in this template. Modify the test source to your needs, or replace it with an alternative source block.

**Tip** When your FPGA model includes more than one IP, you must define each IP as a subsystem and connect the subsystems using a Stream Connector or Video Stream Connector block. For additional information, see "Considerations for Multiple IPs in FPGA Model" on page 1-58.

To enable consistent simulation behavior, click **Open FPGA frame model** in the **Project Shortcuts** tab and repeat this step. To simulate frame-based processing, you must have a DSP System Toolbox license.

#### **Modify the Processor Model**

In the MATLAB toolstrip, on the **Project Shortcuts** tab, click **Open Processor model**. The processor wrapper contains a blue highlighted subsystem representing the user code for the processor algorithm. Open the Processor Algorithm wrapper and replace the Processor Algorithm block with your desired algorithm.



#### **Modify the Register Channel**

The top model of a template also includes a register channel to communicate between the processor and the FPGA model. Use the register channel to configure the FPGA model, or to read and check status registers. The Register Channel block in the template includes one register. To add additional registers you must modify the register channel block parameters, the FPGA algorithm, and the processor algorithm:

- Add registers to the register channel Double-click the Register Channel block to open the block mask and add additional registers as needed. Adding registers creates additional ports on the Register Channel block. For additional information, see Register Channel.
- 2 Add ports to the processor model Navigate to the Processor Algorithm Wrapper model. To navigate to the processor model, click Open Processor model on the Project Shortcuts tab. Double-click Processor Algorithm Wrapper to modify it.
  - For write registers, add an output port to the module and add logic to drive a value to the added output port. For read registers, add an input port and logic to process the information returned from a read register. From the top model, wire the port to the Register Channel block.
- 3 Add ports to the FPGA model Navigate to the FPGA Algorithm Wrapper model. To navigate to the FPGA/Frame based processing model, click Open FPGA sample model on the Project Shortcuts tab. Double-click FPGA Algorithm Wrapper to modify it.
  - For write registers, add an input port to the module and logic to process the information returned from a read register. For read registers, add an output port and logic to drive a value to the added output port.
  - For equivalent behavior when using a Simulink sample-based variant, repeat this step for the sample-based processing model in the FPGA wrapper.
- **4** From the top model, wire the new port to the Register Channel block.

### **See Also**

#### **More About**

• "Use Template to Create SoC Model" on page 1-37

# **SDR Template**

The software defined radio (SDR) template provides a simulation model for an SoC reference design available from Communications Toolbox  $^{\text{\tiny TM}}$  Support Package for Xilinx Zynq $^{\text{\tiny 8}}$ -Based Radio. Use this template to simulate the full reference design and analyze the effects of internal and external connectivity on and SDR algorithm, such as memory behavior and Radio Frequency (RF) I/O behavior.

To get started with SoC Blockset model for designing an SDR system, follow the steps to "Create SoC Model Using SoC Blockset Template" on page 1-37.



#### **Required Products**

- Communications Toolbox
- SoC Blockset Support Package for Xilinx Devices

### **Template Structure**

This template models an SDR transceiver composed of AD9361 transmitter and receiver blocks. The transceiver connects an RF channel to the FPGA, which implements a receiver and a transmitter algorithm. The FPGA algorithm includes a Test Source block,

which generates a sinusoid signal and drives it to the transmitter. The FPGA algorithm also includes a Tx algorithm, implemented as simple pass-through wires, and an Rx algorithm, implemented as a gain block. A configuration register **srcSel** is modeled in the FPGA to select the source of data for the Rx algorithm. The processor writes the **srcSel** register to select either the test source from the FPGA or RF data from the transceiver. This register is modeled using the Register Channel block. Data from the FPGA algorithm is passed to the processor through a Memory Channel block.

Use this template as a guide and replace the Rx Algorithm and Tx Algorithm in the FPGA and the Processor Algorithm in the processor with your own functionality. The memory **Channel Type** parameter is set to AXI4-Stream to software via DMA, which models the direct memory access (DMA) data transfer through shared external memory.

The processor reads the computed data from the memory, and performs additional computing (implemented in the template as a pass-through wire). You can view the simulation results by double-clicking the Vector Scope block in the testbench sink.

### **Modify Project**

#### **Modify the FPGA Model**

In MATLAB, on the **Project Shortcuts** tab, click **Open FPGA sample model**. Then, open the FPGA Transceiver Algorithm Wrapper. Notice three areas highlighted in green. These areas represent user code and are located in the Receiver Algorithm block, in the Transmitter Algorithm block, and the Test Source block.



The FPGA model includes the following sections for you to modify (highlighted in green):

- Test Source block This block generates a 10-kHz sinusoid signal and drives it to the transmitter algorithm. Modify the test source to your needs or replace it with an alternative source block.
- Receiver Algorithm subsystem Inside the green-highlighted subsystem named Rx Algorithm, there is a block labeled Algorithm. The algorithm takes I/Q data as input and output with a valid signal. Replace this block with your own Rx algorithm.
- Transmitter Algorithm Inside the green-highlighted subsystem named Tx Algorithm, the algorithm has an input from the test source and two output signals: one to the RF channel and one to the FPGA. Replace this block with your own Tx algorithm.

To enable consistent simulation behavior, in the **Project Shortcuts** tab, click **Open FPGA frame model** and repeat this step.

#### **Modify the Processor Model**

In MATLAB, on the **Project Shortcuts** tab, click **Open processor model**. The subsystem highlighted in blue represents the user code for the processor algorithm. Open the Processor Algorithm wrapper and replace the internal Processor Algorithm block (also highlighted in blue) with your desired algorithm.





#### Modify the Register Channel

The top model of a template also includes a register channel to communicate between the processor and the FPGA model. Use the register channel to configure the FPGA model or to read and check status registers. The Register Channel block in the template includes one register. To add additional registers you must modify the register channel block parameters, the FPGA algorithm, and the processor algorithm:

- Add registers to the register channel Double-click the Register Channel block to open the block mask and add additional registers as needed. Adding registers creates additional ports on the Register Channel block. For additional information, see Register Channel.
- 2 Add ports to the processor model Navigate to the Processor Algorithm Wrapper model. To navigate to the processor model, click Open Processor model on the Project Shortcuts tab. Double-click Processor Algorithm Wrapper to modify it.
  - For write registers, add an output port to the module and add logic to drive a value to the added output port. For read registers, add an input port and logic to process the information returned from a read register. From the top model, wire the port to the Register Channel block.
- 3 Add ports to the FPGA model Navigate to the FPGA Algorithm Wrapper model. To navigate to the FPGA/Frame based processing model, click **Open FPGA sample**

 $oldsymbol{model}$  on the  $oldsymbol{Project}$  Shortcuts tab. Double-click FPGA Algorithm Wrapper to modify it.

For write registers, add an input port to the module and logic to process the information returned from a read register. For read registers, add an output port and logic to drive a value to the added output port.

For equivalent behavior when using a Simulink sample-based variant, repeat this step for the sample-based processing model in the FPGA wrapper.

**4** From the top model, wire the new port to the Register Channel block.

# **Considerations for Multiple IPs in FPGA Model**

When your FPGA model includes more than one block for which you'd like to generate HDL using HDL Coder, you must use a connector model to connect your blocks.

For additional information, see Stream Connector and Video Stream Connector blocks.

# Create an SoC Project Application

A system-on-chip (SoC) project developed using the SoC Blockset typically contains many diverse systems that make up a the complete application. These systems can include:

- Embedded processors with timer-driven and event-driven tasks.
- FPGAs with custom IP logic and timing.
- External memory systems with interaction to embedded processors and FPGAs.
- I/O device interaction, such as TCP/IP and UDP connections.

This example shows the steps to create an SoC application, using the features of the SoC Blockset, as a Simulink project. To begin, see "Project and Top-Level Model" on page 1-60.

**Note** This project is equivalent to the project automatically created by the "Stream from FPGA to Processor Template" on page 1-49. Templates are the recommended and preferred method for creating new projects. This example should be used for information purposes only.

### **See Also**

"Use Template to Create SoC Model" on page 1-37  $\mid$  "Stream from FPGA to Processor Template" on page 1-49

# **Project and Top-Level Model**

An SoC application model developed using the SoC Blocksetcombines multiple subsystems and reference models. Each subsystem and reference model maps to a particular feature of an SoC device. Organization of the models and shared configuration settings requires a Simulink project.

- Create a new SoC Blockset project named SampleSoCApplication. Creating a new project automatically creates a new project folder with the same name. For more information on creating projects, see "Create a New Project From a Folder" (Simulink).
- 2 Open a new Simulink model. Save the model as soc\_hwsw\_top.slx into the project folder.
- 3 In MATLAB, on the Project tab, click Add Files and add the soc\_hwsw\_top.slx model file to the project.
- 4 In Simulink, configure the soc\_hwsw\_top.slx model to as an SoC application. On the **Apps** tab, under **Setup to Run on Hardware**, click**System on Chip (SoC)**.
- In the **System on Chip (SoC)** pop-up window, select **Hardware Board > Xilinx Zynq ZC706 evaluation kit**. Click **Finish**.

**Note** You can optionally choose any of the available hardware boards based to suit your system requirements.

- 6 On the **System on Chip** tab, click **Hardware Settings**. On the **Configuration Parameters** dialog box, in the **Solver** tab, set **Solver selection > Type** to Variable-step. Click **OK**.
- 7 Create three box areas and label them as FPGA, Memory, and Processor. For more information on creating box areas, see "Box and Label Areas of a Model" (Simulink). In the following sections, these areas are populated for various aspects of your SoC application.



**8** Create a new MATLAB function to initialize variables used throughout the project.

```
function soc_hwsw_init
% Initialize the model wide variables and set them in base workspace.

SourceSTime = 1e-7;

FrameSize = 1000;
ProcSTime = SourceSTime*FrameSize;
FPGASTime = SourceSTime;
FPGAFrameSize = 1;

assignin('base','ProcSTime',ProcSTime);
assignin('base','FPGASTime',FPGASTime);
assignin('base','SourceSTime',SourceSTime);
assignin('base','FPGAFrameSize',FPGAFrameSize);
assignin('base','FrameSize',FrameSize);
end
```

In the project folder, save the file as soc\_hwsw\_init.m in a new subfolder, utilities and add the file to project.

# **See Also**

"Software and Task Management on Processor" on page 1-63  $\,$ 

#### **More About**

- "Create a New Project From a Folder" (Simulink)
- "Box and Label Areas of a Model" (Simulink)

# Software and Task Management on Processor

The processor system in this SoC application reads data from the external memory following a write from the FPGA to that memory. Since FPGA writes and interaction with external memory are asynchronous, the processor uses an event-driven task to read from memory. The software also manages a register on the FPGA that specifies a multiplication factor to be used in the FPGA algorithm.

#### **Processor Model**

- Open a new Simulink model. Save the model as soc\_hwsw\_proc.slx into a new subfolder, named processor, in the project folder. Add the soc\_hwsw\_proc.slx model to the project.
- In Simulink, configure the soc\_hwsw\_top.slx model to as an SoC application. On the **Apps** tab, under **Setup to Run on Hardware**, click**System on Chip (SoC)**.
- In the **System on Chip (SoC)** pop-up window, select **Hardware Board > Xilinx Zyng ZC706 evaluation kit**. Click **Finish**.

**Note** The processor model must use the same hardware board and solver configuration parameter settings as the top level model.

In the model, using a Function-Call Subsystem block, Asynchronous Task Specification block, Inport block, and Outport blocks, create the following system.



- 5 In the dataTask block dialog mask, check **Signal Attributes** > **Output function** call to expose a function call port on the outside model.
- 6 In the Asynchronous Task Specification block dialog mask, set **Task priority** to 50.

**Note** The task priority of the Asynchronous Task Specification block must match the priority of task in the Task Manager block driving this task.

#### **Task Processing**

The Processor Algorithm Wrapper subsystem reads data from the external memory only after each write to the external memory by the FPGA.

- 1 Open the Processor Algorithm Wrapper block.
- 2 Using a Stream Read block, Constant block, Data Type Conversion block, and Subsystem blocks, create the following model.

f()



Add your Processor Algorithm in above subsystem



- 3 Open the Stream Read block dialog mask. Set **Number of buffers** to 6.
- 4 Open the Data Type Conversion block dialog mask and set **Output data type** to uint16.
- 5 The Processor Algorithm subsystem serves as a base to develop your own processing algorithm.
- 6 Open the Register Channel Write subsystem block.
- **7** Add a Register Write block to create the following model.



Open the Register Write block dialog mask. Set Device name to /dev/ mwsinewavegenerator\_ip0 and Offset address to hex2dec('100').

### **Top Model**

- In the project folder, open the model soc hwsw top.slx.
- 2 Add a Subsystem block into the Processor area and label the block Processor.
- In the Processor subsystem, using the Task Manager block and Model block, create the following system.



- 4 Open the Model block dialog mask and set **Model name** to soc hwsw proc.slx.
- Open the Task Manager block dialog mask. Set the task **Name** to dataReadTask and set the **Priority** to 50. In the **Simulation** tab, set the **Mean**, **Min**, and **Max** to 8e-05. Click **OK**.

### See Also

Task Manager

### **More About**

- "What is Task Execution?" on page 1-2
- "Event-Driven Tasks" on page 1-4
- "Task Duration" on page 1-16

# **User Logic on FPGA**

In this SoC project example, the FPGA generates test data and process it in FPGA algorithm before passing it to processor using shared memory.

### **Sample Based Model**

Open a new Simulink model. Save the model assoc\_hwsw\_fpga\_sample.slx into the subfolder, named referencedmodels, in the project folder.

2

3 On the Modelling tab, click Model Settings. On the Configuration parameters window, in the Hardware Implementation panel, set Hardware board to None and set Device vendor to ASIC/FPGA. In the Solver panel, set Solver selection > Type to Fixed-step. Click OK to apply the changes and close the configuration parameters.

**Note** SoC Blockset requires that the FPGA reference models specify the intended deployment hardware, in this case an FPGA.

4 In the new model, using Stream Connector block, SoC Bus Selector block, SoC Bus Creator block, and Subsystem blocks, create the following system.



**Note** The signals for rdCtrlIn and rdCtrlOut must use bus signal types set to StreamS2MBusObj and StreamM2SBusObj, respectively.

**Tip** When your FPGA model includes more than one IP, you must define each IP as a subsystem and connect the subsystems using a Stream Connector or Video Stream Connector block. For additional information, see "Considerations for Multiple IPs in FPGA Model" on page 1-58.

In the SoC Bus Creator block dialog mask, set Control type to Valid.

5 The Test Source subsystem simulates a free-running counter. Open the Test Source subsystem and create the following system.



Add your 'Test Source' in the above subsystem



**Note** The sources, All data is valid and No-Op Tlast, must produce a signal with boolean data type.



Replace 'Test data (counter) source with your test source



6 The FPGA Algorithm subsystem simulates the multiplication of streamed data. Open the FPGA Algorithm subsystem and using an Enabled Subsystem, Logical Operator, and Data Type Conversion blocks, create the following system.



Add your FPGA algorithm in above subsystem



### **Top Model**

- 1 In the project folder, open the model soc\_hwsw\_top.slx.
- 2 Add a Subsystem block into the FPGA area and label the block FPGA.
- 3 In the FPGA subsystem, using the Model block, create the following system.



4 Open the Model block dialog mask and set **Model name** to soc\_hwsw\_fpga\_sample.slx.

The "Stream from FPGA to Processor Template" on page 1-49, the FPGA subsystem uses a model variant to select between the sample based model developed in this section and a frame based model. The frame based model allows faster simulations but does not support code generation.

### See Also

SoC Bus Creator | SoC Bus Selector | Stream Connector

### **More About**

• "AXI4-Stream Interface" on page 1-25

### **Memory and Register Channel Connections**

The memory channel models the data transfer from FPGA to processor using shared external memory. The register channel models the control of FPGA logic from processor. You can both configure the FPGA logic and read the status of FPGA logic from processor. The following sections show how to connect create these channel connections.

### **Memory Channel Connection**

- 1 Open the soc\_hwsw\_top.slx model.
- Add a Memory Channel block and a Memory Controller block to the Memory area. Together, these blocks model the memory connection through DDR between the processor and FPGA sides of your application.



- Open the Memory Controller block dialog mask. Set **Number of masters** to 2. In the **Advanced** tab, the Memory Controller automatically inherits parameters from the **Hardware board** specified in the model configurations.
- 4 Connect the pair of Memory Controller burst ports, burstReq and burstDone, to the read and write burst request ports of the Memory Channel block.
- 5 In the model, open the Memory Channel block dialog mask. Set Channel type to AXI4-Stream to Software via DMA. Set Buffersize (bytes) to FrameSize\*4 and Number of buffers to 6. Click OK.

### **Register Channel Connection**

Add a Register Channel block to the model and connect the block to the Processor and FPGA subsystems as shown in the following image.



2 Open the Register Channel block dialog mask. Add a new register with these properties.

| Register  | Direction | Data type | Dimension |
|-----------|-----------|-----------|-----------|
| configReg | Write     | uint8     | 1         |

Set **Register write sample time** to FPGASSTime. Click OK. This sample time is set in the file  $soc_hwsw_init.m$ .

### See Also

Memory Controller | Memory Channel | Register Channel

### **More About**

• "Memory and Register Channel Connections" on page 1-72

# **Simulation and Analysis**

This set of steps runs the soc\_hwsw\_top.slx model created in the previous steps. A visual of the processor output data shows the complete SoC application.

- 1 In the project folder, open the model soc hwsw top.slx.
- 2 Using a Scope block and Rate Transition block, update the model as shown in this diagram.



**3** Run the model and open the Vector Scope.

The display in the  ${\tt Vector}\,$   ${\tt Scope}$  shows the counter output.

### See Also

"Use Template to Create SoC Model" on page 1-37 | "Stream from FPGA to Processor Template" on page 1-49  $\,$ 

### **Custom Hardware Board Configuration**

A custom hardware board is a hardware board that not explicitly supported as a default selection in SoC Blockset. To create an SoC project to simulate a custom hardware board, configure a Simulink project as follows:

- 1 Create or open an existing SoC project. For more information on creating SoC projects, see "Use Template to Create SoC Model" on page 1-37.
- In the top level model, open the Simulink configuration parameters dialog. In the Hardware Implementation panel, set Hardware board to Custom Hardware Board.
- 3 In the Hardware Implementation panel, open the Target hardware resources > Processor group. Set Number of cores to match the number of cores available on your SoC processor. The cores available in your processor can be found from the SoC manufacturer's data sheet.
- 4 Open the Target hardware resources > FPGA design (mem controllers) group and set the "FPGA design (mem controllers)" configuration parameters according to your SoC specifications. For information on deriving "FPGA design (mem controllers)" parameters, see the Memory Controller block which shares these parameters.
- 5 Open the **Target hardware resources** > **FPGA design (mem channel)** group and set the "FPGA design (mem channels)" configuration parameters according to your SoC specifications. For information on deriving "FPGA design (mem channels)" parameters, see the Memory Channel block which shares these parameters.

**Note** The Custom hardware board selection only supports simulation. For code generation, use one of the provided SoC Blockset hardware board selections.

### **See Also**

"Hardware Implementation Pane"

# **Build Error for Rapid Accelerator Mode**

SoC Blockset does not support "Rapid Accelerator Mode" (Simulink) simulation of models. Attempting to use SoC Blockset blocks and features in model running rapid accelerator mode results in undefined behavior.

In SoC Blockset models, set the simulation mode to normal mode, accelerator mode, or external mode.

### See Also

#### **More About**

"Rapid Accelerator Mode" (Simulink)

# **Simulate SoC Applications**

- "Task Overruns and Countermeasures" on page 2-2
- "Task Execution Playback using Recorded Data" on page 2-7
- "Task Priority and Preemption" on page 2-8
- "Multicore Execution and Core Visualization" on page 2-12
- "Recording Tasks for Use in Simulation" on page 2-16
- "Task Visualization in Simulation Data Inspector" on page 2-18
- "Simulation Performance Plots" on page 2-20
- "Simulation Diagnostics" on page 2-34
- "External Memory Channel Protocols" on page 2-39
- "Record Data from Hardware I/O Devices" on page 2-42
- "Use Memory and I/O Device Data in Processor Simulation" on page 2-43

### **Task Overruns and Countermeasures**

With finite processing resources available in a system, an execution instance of a task might not be able to complete before the start of the next task instance. This task overrun results in the start of the next instance of the task execution to be delayed. As a result, the next task must catch-up to avoid another overrun. This diagram shows a simplified execution of two tasks: a high-priority event-driven task and a low priority timer-driven task.



Due to the long execution time of the event-driven task, the first execution instance of the timer-driven task overruns into the start of the next execution instance. This overrun puts the second execution instance into catch-up mode.

When tasks overrun repeatedly, an execution backlog can develop in the application, potentially breaking the system. These sections discuss typical countermeasures to either reduce the chance of task overruns or handle situations when tasks overrun, preventing an execution backlog.

#### **Reduction of Task Execution Interval**

For timer-driven tasks, reduce the chance of overruns by providing the task with more execution time. Increase available execution time by decreasing the task rate, which is equivalent to increasing the time between task execution instances. This extra time provides each task execution instance a better chance of running to completion, even in the presence of other tasks. The rate of a timer-driven task can be adjusted in the Task Manager block by setting the **Period** parameter.

Reduction of the task execution interval cannot be guaranteed in all cases. Some of these cases include:

- For event-driven tasks, multiple events can occur at the same time, depending on the
  priority of the event-driven task. This case forces other tasks to overrun due to lack of
  resources.
- Real-time requirements where a task, timer or event driven, must respond to the latest
  event trigger signal and new data regardless of whether previous task instances
  completed. This case fixes the task execution interval to a value determined by the
  design requirements.

In these cases, distributing tasks across multiple processor cores or allowing tasks to drop can be advantageous depending on the design requirements.

### **Distribution of Tasks Across Multiple Processor Cores**

Most modern embedded processors provide multiple cores where tasks can be executed. By distributing tasks across these multiple processor cores, tasks can run simultaneously without directly competing for processing resources and reducing the chance of task overruns. In SoC Blockset, a task can be set to run on a specific processor core in the Task Manager block by setting the **Core** parameter to the core number. For more information on the selection, execution, and visualization of tasks on multiple cores, see "Multicore Execution and Core Visualization" on page 2-12.

### **Dropping Overrunning Tasks**

In some designs, a task must execute when the task trigger signal occurs or with the latest state of the system. If a task has been triggered and a new task trigger occurs, the new instance can be removed or *dropped*. After dropping the execution instance of the task that overran the next execution instance starts when the event trigger signal arrives.

To drop tasks when an overrun occurs, in the Task Manager block, enable the **Drop task that overrun** parameter.

#### **Task Drops in Simulation**

This example shows how to configure a task in the Task Manager block to drop when a task overrun occurs during simulation.

#### **Task Overrun Without Task Drops**

This model simulates a software application running on an ARM processor. A Task Manager block schedules the execution of the Timer Driven Subsystem, inside the Software Application Model Reference block. A Random Number block simulates a data source that the timer-driven task samples.



In this model, the task duration of 0.6 seconds exceeds the task period of 0.5 seconds causing the task to overrun. Click the Run button to build and run the model. When the model finishes running, the Simulation Data Inspector (SDI) display shows the task execution timing.



Inspecting the execution timing of the tasks shows that the start of each following task instance is delayed from the expected  $\theta$ . 5-second interval by the overrun of the previous task.

#### **Task Overrun With Task Drops**

Using the same previously shown model, rather than overrunning the timer-driven task, the task drops so the next task instance starts at the 0.5-second interval. Open the Task Manager block dialog mask, and select  $\bf Drop\ tasks\ that\ overrun$ . Run the model again. Open the SDI to view the task execution and dropped task instances.



# **See Also**

Task Manager

### **More About**

• "Multicore Execution and Core Visualization" on page 2-12

### Task Execution Playback using Recorded Data

The Task Manager block can replay the execution timing of a task recorded from either a previous simulation of that task or from the execution of a task on a processor in an SoC device. To replay a task timing data file, use the following procedure:

- 1 In a Simulink model, open the Task Duration block dialog box.
- **2** Select a task from the list of available tasks.
- 3 In the Simulation tab, select Play back recorded task diagnostics file.
- 4 Click **Browse** to select a *taskname*.csv file from a previous task simulation.

While using the data file for the task timing information, the Task Manager still manages individual tasks according the scheduling of the system and can be preempted by other higher priority tasks in the model. For more information on task priority and preemption, see "Task Priority and Preemption" on page 2-8.

### See Also

Task Manager

### **Related Examples**

• "Task Execution" on page 5-77

### **More About**

• "Task Duration" on page 1-16

# **Task Priority and Preemption**

Task priority informs the operating system of the importance of the task and the order in which a group of waiting tasks needs to execute. By setting the priorities of the tasks in the Task Manager block, tasks that need to react to critical or time-sensitive events can preempt lower priority and background tasks.

Tasks listed in the Task Manager block execute in a *rate monotonic* order. Rate-monotonic order requires the task with the highest static priority in the preempted state to immediately preempt all other tasks and enter the running state. Timer-driven tasks with shorter periods get higher static priorities. If two tasks with equal priority in the preempted state, when no other running task exists, then tasks execute in a first-in, first-out (FIFO) order.

Each event-driven task listed in the Task Manager block can be set with an explicit execution priority. Timer-driven tasks inherit their priority from the base rate task priority of the model. In the configuration parameters, the base rate task priority is set by the Hardware Implementation > Hardware board settings > Operating system/scheduler > Base rate task priority parameter. The following example shows the interaction between a pair of competing tasks.

### Preemption of Low Priority Task by High Priority Task

This example shows how the task manager changes the state of two tasks, preempting the lower priority task to allow the high priority task to run.

### Task Manager with High and Low Priority Tasks

The following model simulates a software application with a high and low priority task. A Task Manager block schedules the execution of the task subsystems inside the Software Application Model Reference block.



The low priority, timer driven, task is scheduled to run every 0.5 seconds with a duration of 0.2 seconds. The high priority, event driven, task is scheduled to run when a new UDP data packet arrives, which occurs every 1.1 seconds and requires a task duration of 0.5 seconds. As a result of these timing conditions, the low priority task gets preempted to allow the high priority task to run.

#### **Simulation Showing Task Preemption**

Click the Run button to build and run the model. When the model finishes running, open the SDI display to see the results of the simulation. Select the HighPriority and LowPriority task waveforms to see the task preemption.



Inspecting the SDI display at time 1.0, the low priority task starts executing until time 1.1, getting preempted by high priority task. The low priority task then runs to completion

at 1.7 seconds, overrunning the next instance of the low priority task that should have started at 1.5 seconds.

### See Also

Task Manager

#### **More About**

- "What is Task Execution?" on page 1-2
- "Task Overruns and Countermeasures" on page 2-2

### **Multicore Execution and Core Visualization**

SoC Blockset enables simulation of task executions as they behave on a multicore processor. In multicore simulations, tasks can run simultaneously when assigned to different processor cores. Additionally, assigning lower-priority tasks to unique cores prevents these tasks from getting preempted, giving greater confidence to the final application.

### **Specify the Core for a Task**

To set the processor core on which a task executes, open the Task Manager block dialog mask. Select a **Task** from the available tasks. In the task properties, set **Core** to a nonnegative integer value. During simulation, task instances execute on the specified core, subject to the preemption by other tasks executing on the same core. For more information on task preemption, see "Task Priority and Preemption" on page 2-8.

### **Core Visualization in Simulation Data Inspector**

SoC Blockset provides a view of the processor cores on the Simulation Data Inspector. This diagram shows the visualization of the core activity relative to the task state.



In the Simulation Data Inspector, the signal corei shows the current task executing on that core. When the core activity displays as idle, then that core has all tasks in the waiting state, and the kernel can use that core for background tasks that are not part of the main application.

**Note** If a task instance does not run to completion during the simulation time, the related core status over that instance appears empty in the SDI display.

#### **Multi-Core Task Execution**

This example shows the simulation of multiple tasks, managed by the Task Manager block, execute on multiple cores with display the core activity shown in the Simulation Data Inspector.

This model simulates a software application, running on an ARM processor, with 3 timer-driven tasks. A Task Manager block schedules the execution of the tasks, inside the Software Application Model Reference block. Tasks 1, with a period of 0.01 seconds, executes on Core 0. Tasks 2 and 3, with periods of 0.02 and 0.03 seconds, respectively, execute on Core 1.



ARM Processor

Click the Run button to build and run the model. When the model finishes running, open the SDI display to see the results of the simulation. Select the **Core 0** and **Core 1** to view the core execution status.



As shown in the SDI display, the core executes either the running task or moves to an idle state, to perform background kernel tasks. Additionally, as two cores are used in this application, high-priority, Task1 executes at the start of each trigger event. Similarly, Task2 and Task3 do not get preempted by Task1. As a result, the application makes better use of the available processor resources.

### See Also

Simulation Data Inspector | Task Manager

#### **More About**

• "Task Priority and Preemption" on page 2-8

### **Recording Tasks for Use in Simulation**

Each time a model containing a Task Manager block runs in simulation or on an embedded processor with external mode, Simulink records task execution data and statistics as a set of files. A diagnostics folder, with name <code>modelname\_diagnostics</code>, contains two subfolders, <code>sim</code> and hw, for the data from simulations and recorded from hardware, respectively. Each run generates a unique folder, inside either the <code>sim</code> or hw folders, labelled by the date and time of the run. The folder name uses a time-date format, <code>YYYY\_MM\_DD\_hh\_mm\_ss</code>, representing the year, month, day, hour, minute, and second, respectively.

**Note** To enable external mode in an SoC model, use the **SoC Builder** app.

Each run generates a set of metadata, statistics, and execution recording files, including:

- TaskInfo.mat This file contains task information, including the task names and types, used internally by the SoC Blockset.
- metadata.csv This file contains the derived mean and standard deviation for all tasks recorded in the profile.log data file. The metadata.csv file can be used directly in the Task Manager block to set task duration statistics. For more information on setting task duration, see "Task Duration" on page 1-16.
- TaskName.csv This file contains the recorded task execution data as a commaseparated variable list. The first column contains the start time of each task instance. The second column contains the task durations for each task instance. If a task is dropped, lost, or corrupted, the start time and duration of that task execution instance are both replaced by -1. For more information on using recorded task execution timing in simulation, see "Task Execution Playback using Recorded Data" on page 2-7.

#### Note

- Tasks recorded from an embedded processor only start capturing task execution after successful connection of external mode. The lost start-up in task execution recordings from hardware should be considered when comparing timing results to recordings from simulation.
- When executing on an embedded processor, task execution recordings times will
  continue to run until the completion of all task instances scheduled in the Task
  Manager prior to the stop time of the model.

### **See Also**

Task Manager

### **More About**

- "Task Duration" on page 1-16
- "Task Execution Playback using Recorded Data" on page 2-7
- "Profile Task Execution on Processor" on page 4-7

# Task Visualization in Simulation Data Inspector

The Simulation Data Inspector (SDI) display provides a direct view into the execution timing, the task state, and the execution of tasks in simulation and profiled from generated code running on hardware. Each model run, in simulation or on hardware using external mode, creates a separate **Run** for the task execution timing and data. This image shows the SDI display with a **Run** captured from an SoC Blockset model.



Each **Run** contains these signal types:

• taskname - The execution instance state for the task, with name taskname, defined in the Task Manager block. For more information on task execution states, see "What is Task Execution?" on page 1-2.

**Note** If a task instance does not run to completion during the simulation time, the final task execution instance does not render in the SDI display.

• taskname\_drop - An impulse indicating the scheduler dropped an execution instance of task, taskname\_drop. For more information on task drops, see "Task Overruns and Countermeasures" on page 2-2.

• Core: *n* - Execution activity on core *n* of the simulated processor. For more information on multicore execution and visualization, see "Multicore Execution and Core Visualization" on page 2-12.

**Note** If a task instance does not run to completion during the simulation time, the related core status over that instance does not render in the SDI display.

### See Also

Simulation Data Inspector | Task Manager

#### **More About**

- "What is Task Execution?" on page 1-2
- "Task Overruns and Countermeasures" on page 2-2
- "Multicore Execution and Core Visualization" on page 2-12

### **Simulation Performance Plots**

SoC Blockset enables post-simulation analysis of memory diagnostic data. These plots provide high-level performance diagnostics of the memory system of the model. These plots are calculated measurements from a simulation of your model. It considers the data type, sample time, and clock frequency to calculate the bandwidth of your memory model and considers the number of bursts executed per memory port.

To enable signal logging in simulation, select **Hardware Implementation** on the Configuration Parameters dialog box. Under **Hardware Board Settings > Target Hardware Resources > FPGA design (debug)**, select the desired **Memory channel diagnostic level**.

This figure shows the datapath from one FPGA algorithm to another FPGA algorithm through a memory channel.



You can view channel latency plots for the datapath (represented by A, B, C, and D in the image) from the Memory Channel block mask. You can view memory bandwidth, burst count, and control-latency measurements (represented by 1, 2, 3, and 4 in the image) from the Memory Controller block mask.

The datapath from an FPGA algorithm to a processor is served through a DMA driver and a task processor and is illustrated in this image.



### **Memory Channel Latency Plots**

Memory Channel latency information is available post simulation per channel. After simulating your model, open the Memory Channel block mask. On the **Performance** tab, click **Launch performance plots**. This action opens a new window with several control options to display these different latencies:

- **Buffer write complete** This option shows the time it takes between issuing a write request to when the buffer is fully written. It is the path between A and B in the figure.
- **Buffer read complete** This option shows the time it takes between issuing a read request to when the buffer is read and is available again for writing. It is the path between C and D in the figure. This option is only available if the reader is an FPGA algorithm (not a processor algorithm). If the reader is a processor algorithm, this time shows as zero.
- **Buffer task execution complete** This option shows the time it takes between issuing a read request to when the buffer is read and is available again for writing. It is the path between C and D in the figure. This option is only available if the reader is a processor algorithm (not an FPGA algorithm). If the reader is an FPGA algorithm, this time shows as zero.

The **Buffer task execution complete** shows the time it takes for these events to occur:

- **1** The write buffer is full.
- **2** The channel issued an interrupt request (IRQ) to the processor.
- **3** An interrupt service routine (ISR) is executed.
- 4 A task is scheduled.
- **5** The task started executing.
- **6** The task read data.
- **7** The task optionally processed the data.
- **8** The task sends a **done** signal back to the channel.

This following figure shows the latency path for a task execution to complete, as a red arrow from C to D.



- Averaging Window (s) Specify a time, in seconds, for the averaging window width.
  The plot is graphed as a moving average, using a time window with the width
  specified. You can also specify min, max, or auto.
  - min Use this value to see data without any averaging. The total latency graph is aligned with the **Instantaneous Total Latency** marks.
  - max Use this value to see the overall average for the entire simulation.
  - auto Use this value to see averaging over the number of buffers in your channel.

• Instantaneous Total Latency – This shows discrete total latency measurements per buffer.

If you add **Buffer write complete** to **Buffer read complete** or **Buffer task execution complete**, the plot displays the full latency from writer to reader. This image shows the total latency plot for the "Streaming Data from Hardware to Software" on page 5-39 example.



Note that the latencies are showing over an averaging window of one second. The instantaneous total latency shows a peak in latency as 76.8267 ms. Use this information to verify the model against the requirements.

# **Memory Controller Latency Plots**

Memory Controller latency information is available post simulation. After simulating your model, open the Memory Controller block mask. On the **Performance** tab, click **Launch performance plots**. This action opens a new window with several control options to display performance metrics.

This figure shows the datapath from one FPGA algorithm to another FPGA algorithm through a memory channel.



In the **Latencies** tab, select the master for which you want to graph latencies. Choose from any of these options:

- Burst request to first transfer complete This option shows the time it takes from the moment the Memory Channel block issues a burst-write request to the first transfer of data. This latency accounts for arbitration or interconnect delays. It is the path between 1 and 2 in the figure.
- **Burst execution latency** This option shows the time it takes from the first transfer of data to when a burst is written to memory. It is the path between 2 and 3 in the figure.

- Burst last transfer to complete latency This option shows the time it takes from
  the moment a burst completes to when the Memory Controller block issues a burst done signal to the Memory Channel block. It is the path between 3 and 4 in the figure.
- Averaging Window (s) Specify a time, in seconds, for the averaging window width.
  The plot is graphed as a moving average, using a time window with the width
  specified. You can also specify min, max, or auto.
  - min Use this value to see data without any averaging. The total latency graph is aligned with the **Instantaneous Total Latency** marks.
  - max Use this value to see the overall average for the entire simulation.
  - auto Use this value to see averaging over 1% of the bursts during the simulation.
- **Instantaneous Total Latency** This option shows discrete total latency measurements per burst.

Click **Create Plot** to see the latency, for the selected masters over the duration of the simulation time. This image shows the total latency for Master 2 in the "Analyze Memory Bandwidth Using Traffic Generators" on page 5-53 example.



**Note** Memory controller latency plots are not available when the master is a processor.



# **Memory Bandwidth Plots**

In the **Bandwidth** tab, select the masters for which you want to graph bandwidth. Click **Create Plot** to see the bandwidth, in megabytes per second, for the selected masters over the duration of the simulation time. This image shows the bandwidth for the "Analyze Memory Bandwidth Using Traffic Generators" on page 5-53 example.



**Note** Bandwidth information is not displayed when a master is a processor.

# **Memory Burst Plots**

In the **Bursts** tab, select the masters for which you want to graph bursts. Click **Create Plot** to see the number of bursts executed for the selected master over the duration of the simulation time. This image shows the burst count for the "Analyze Memory Bandwidth Using Traffic Generators" on page 5-53 example.



**Note** Bandwidth information is not displayed when a master is a processor.

# See Also

Memory Controller | Memory Channel

#### **More About**

• "Memory Performance Information from FPGA Execution" on page 4-9

# **Simulation Diagnostics**

SoC Blockset enables simulation and evaluation of memory transactions in Simulink without the need to deploy a model to an SoC device. Use this diagnostic information to analyze the performance of your models, and adjust as needed to meet the desired system performance requirements. The simulation generates two types of visualization of the memory traffic:

- "Simulation Performance Plots" on page 2-20 Provides high level performance
  diagnostics of the model's memory system. Memory bandwidth, burst counts, and
  transaction latencies are calculated from a simulation of your model. You can view this
  information for each memory master in your model, or an overall view from the
  memory controller.
- "Buffer and Burst Waveforms" on page 2-34 Provides burst transaction debug information from simulation, including the use of buffer regions.

You can also capture actual bandwidth, number of bursts, and latency measurements from the design running on the FPGA, and view information about individual burst transactions. This information is captured by including an AXI interconnect monitor IP in the FPGA design, and querying the data over a JTAG AXI master connection from the host. See "Memory Performance Information from FPGA Execution" on page 4-9.

#### **Buffer and Burst Waveforms**

SoC Blockset enables logging simulation signals, and visualizing the logged signals using the *Logic Analyzer*. To enable signal logging, Set **Memory diagnostics level** to Basic diagnostic signals in the configuration parameters of the model, under **Hardware Implementation** > **Target hardware resources** > **FPGA design (debug)**.

After simulating your model, locate the **Logic Analyzer** at the top of your Simulink window.



The **Logic Analyzer** tool provides visualization of signal waveforms to show timing of various events of the memory model.

The **Logic Analyzer** displays signals from the Memory Controller and from the Memory Channel blocks.

#### Burst Waveforms

Waveforms from the memory controller include information for bursts from the masters in the system. The waveforms are color coded to differentiate the different masters. These waveforms give insight into the sequencing of each of the masters through the shared memory. For each master, view the following signals:

- BURST\_EXECUTION\_EVENT: State of the current burst request. Valid states are: none (idle), request, executing, done. For more information about the memory controller state, see Memory Controller.
- ReqID: Identifier of the current burst request. An incrementing number that is unique throughout simulation.
- burstTransfersCompleted: A running count of transferred bursts. If no bursts are dropped within the memory channel, the count of transferred bursts matches RegID. If bursts are dropped, RegID becomes larger than this count.
- BytesTransferred: A running count of transferred bytes.

The following figure shows the signals after simulating "Analyze Memory Bandwidth Using Traffic Generators" on page 5-53.



The waveforms include burst information for the four masters, displayed in different colors. This information correlates to the "Memory Controller Latency Plots" on page 2-25.

#### Buffer Waveforms

Waveforms from the memory channel include information for buffer read and write transactions in the channel. Each memory region is divided into several buffers specified by the **Number of buffers** parameter of the Memory Channel block. The writer fills the buffers, and the reader empties them. These waveforms give insight into the sequencing of the writer and reader for a given region. The buffer waveforms include the following signals:

- REGION\_BUFFER\_EVENT: State of the current buffer request. Valid states are: none
  (idle), request, executing, done. For more information about the state of the
  memory channel, see Memory Channel.
- BufReqID: Identifier of the current buffer request. An incrementing number that is unique throughout simulation.
- BufferAddress: Starting address offset of the current buffer. The buffer address repeats as the simulation cycles through the buffers, reflecting the address boundaries of the buffers.
- BufGntCurrentBuf: The currently active buffer specified from 1 to the number of buffers in the channel. BufGntCurrentBuf points to the buffer being written to (on the writer side), or the buffer being read from (on the reader side).
- BufRelCurrentBuf: The buffer currently released by the reader or writer specified from 1 to the number of buffers in the channel. On the reader side, when a buffer is released it is available to the writer for writing. On the writer side, when a buffer is released it is available to the reader for reading.
- BufAvail: The number of buffers currently available to the reader for reading. This value is identical on the reader and the writer side.
- BufTransfersCompleted: A running count of transferred buffers. If no buffers are dropped within the memory region, the count of transferred buffers matches BufReqID. If buffers are dropped, BufReqID is larger than this count.
- icFIF0Entries: Number of bursts written to the interconnect FIFO.
- icFIFODroppedCount: Number of bursts dropped from the interconnect FIFO.
- The following figure shows the buffer signals after simulating "Histogram Equalization Using Video Frame Buffer" on page 5-25.



You can relate the memory model operation with the protocol interface to understand the performance of your model. The following figure shows how to relate the memory model operation with the protocol interface.

- Backpressure from rdReady
- 2. Reader cannot finish buffer 2.
- Writer's FIFO begins filling up.
- 4. Writer is blocked and must assert back-pressure upstream.
- Reader gets to finish buffer.
   Everyone starts moving again.



#### See Also

Logic Analyzer | Memory Controller | Memory Channel

#### **More About**

• "Simulation Performance Plots" on page 2-20

# **External Memory Channel Protocols**

The signal interfaces added to the channel model for the writer and reader are protocols that the algorithms use to communicate with the channel. Protocols do not change the core of the external memory channel model, which operates on burst transactions. They control only how the data gets in or out of those channels.

For FPGA or ASIC IPs, typical protocols include streaming data, streaming video data, and addressable data transfers. For software, typical protocols presented to an algorithm include simple data buffer, with details about interrupts, buffer management, and task scheduling left to the underlying OS.

Configure the Memory Channel block to support various protocols.

#### AXI4 Stream to Software via DMA

The AXI4-Stream Software configuration provides a software streaming protocol. Choose this configuration when a processor acts as a reader/writer to the memory. This protocol includes a trigger configuration, which the Task Manager block receives and reads. The trigger signals that the memory buffer is ready for writing or reading. For more information about AXI4-stream protocol, see "AXI4-Stream Interface" on page 1-25.

#### **AXI4 Stream FIFO**

The AXI4-Stream configuration provides a simple data valid and ready protocol for data streaming. You can generate a fully compliant AXI4-Stream interface from this protocol using HDL Coder.

For data stream channels, memory addressing is automatic. The channel is responsible for converting the stream to buffer addresses as a DMA core would. The relationship of the stream to the managed buffers in the external memory is through an 'end of buffer' signal, known as tlast for AXI4-Stream. For more information about AXI4-stream protocol, see "AXI4-Stream Interface" on page 1-25.

## **AXI4 Stream Video FIFO**

The AXI4-Stream Video FIFO configuration provides a data valid and ready protocol similar to the AXI4 Stream FIFO. This protocol also has additional signaling to mark the start or the end of a video line and start or end of a video frame. This protocol is

compatible with the HDMI Rx and HDMI Tx blocks, available with the SoC Blockset Support Package for Xilinx Devices. You can generate a fully compliant AXI-Stream video streaming interface from this protocol using HDL Coder. For information about the HDMI blocks, see documentation for SoC Blockset support packages.

For streaming video data channels, memory addressing is automatic. The channel is responsible for converting the stream to buffer addresses as a DMA core would. The stream relates to the managed buffers in the external memory through the pixel control bus signals, which demarcate lines and frames. For more information, see "AXI4-Stream Video Interface" on page 1-32.

#### AXI4 Stream Video Frame Buffer

The AXI4-Stream Video Frame Buffer configuration provides The same signaling as the AXI4 Stream Video FIFO, with additional control signals for frame-buffer synchronization. This protocol is compatible with the HDMI Rx and HDMI Tx blocks, available with the SoC Blockset Support Package for Xilinx Devices. You can generate a fully compliant AXI-Stream video streaming interface from this protocol using HDL Coder. For information about the HDMI blocks, see documentation for SoC Blockset support packages.

For streaming video data channels, memory addressing is automatic. The channel is responsible for converting the stream to buffer addresses as a DMA core would. The stream's relationship to the managed buffers in the external memory is through the pixel control bus signals, which demarcate lines and frames.

#### **AXI4 Random Access**

The AXI4 configuration provides a simple, direct interface to the memory interconnect. Unlike the previous two streaming protocols, this protocol allows the algorithm to act as a memory master by providing the addresses and managing the burst transfer directly. This protocol represents a simplified master protocol. You can generate a fully compliant AXI-4 interface from this protocol using HDL Coder. For more information about the simplified AXI4 interface, see "Simplified AXI4 Master Interface" on page 1-28.

#### See Also

Memory Channel

#### **More About**

- "Simplified AXI4 Master Interface" on page 1-28
- "AXI4-Stream Interface" on page 1-25
- "AXI4-Stream Video Interface" on page 1-32

#### **Record Data from Hardware I/O Devices**

Models using recorded data in simulation can reproduce the behavior of the application when implemented onto a physical hardware or device. SoC Blockset provides a set of functions that can connect and record I/O device data directly from a hardware board. The recorded data file can then be used in an SoC Blockset model simulation.

#### **Process to Record Data**

To record I/O data from a hardware board, you can follow the general sequence of steps below.

- **1** Configure Hardware Connect and configure your hardware board. You may need to install the hardware support package for your hardware board.
- **2** Create Data Recorder A data recorder object manages the I/O hardware peripherals and stores the data during the data collection process.
- 3 Choose I/O Devices Choose from the available I/O devices on the hardware board and add them to the data recorder object.
- **4** Setup Recorder Prepare the hardware board for the data recording process. This setup includes any initialization and configuration of the hardware I/O devices to be recorded.
- 5 Start Recording Start the data recorder on the hardware. The data recorder executes and collects data from the hardware I/O devices for the specified period.
- 6 Execute Hardware Operations Run hardware operations on the hardware board that exercise the peripherals being recorded. Operations can include sending signals to an analog-to-digital converter or reading data received on a UDP channel.
- **7** Save Data Save the data stored in the data recorder to a file on your development computer.

The resulting data file can now be used in the simulation of the hardware blocks.

# Use Memory and I/O Device Data in Processor Simulation

The **Processor I/O** sub-library in SoC Blockset contains blocks that simulate the data transfer between the processor system and memory or I/O devices in the SoC application. **Processor I/O** blocks, including the Register Read, Register Write, and Stream Read, can read and write data to memory, such as DDR or hardware registers, on the SoC. Similarly, the TCP Read, TCP Write, UDP Read, and UDP Write blocks can read and write data to external I/O devices.

In simulation, an IO Data Source block sends data messages to the **Processor I/O** block. Together, this mechanism allows tasks to simulate using either previously recorded or generated I/O data with timing accurate execution.

The IO Data Source block and a **Processor I/O** block can be configured to simulate in one of three modes:

- Replay recorded data from file
- From input port
- Zeros

#### **Event-Driven Task**

For event-driven task, the recorded data The IO Data Source also sends event messages to the Task Manager block to start the task containing the **Processor I/O** block.

#### **Timer-Driven Task**

#### See Also

IO Data Source | Stream Read | TCP Read | Task Manager | UDP Read

# **Generate Code and Deploy on SoC Device**

- "Supported Third-Party Tools and Hardware" on page 3-2
- "Code Generation of Software Tasks" on page 3-4
- "Generate SoC Design" on page 3-5

# **Supported Third-Party Tools and Hardware**

#### **Third-Party Synthesis Tools and Version Support**

SoC Blockset supports these third-party FPGA synthesis tools:

- Intel® Quartus® Prime Standard Edition 18.1
- Xilinx Vivado® Design Suite 2018.3

To use third-party synthesis tools with SoC Blockset, a supported synthesis tool must be installed, and the synthesis tool executable must be on the system path.

#### **Third-Party Support for Software Generation**

SoC Blockset supports this third-party software generation tool:

• Intel SoC FPGA Embedded Development Suite (EDS) 18.0

#### **Supported Xilinx Devices**

SoC Blockset supports execution on Xilinx devices shown in this table.

| <b>Device Family</b>     | Board                                        |
|--------------------------|----------------------------------------------|
| Xilinx Artix®-7          | Artix-7 35T Arty FPGA Development Board      |
| Xilinx Kintex®-7         | Kintex-7 KC705                               |
| Xilinx Zynq              | Zynq-7000 ZC706                              |
|                          | ZedBoard™                                    |
| Xilinx Zynq UltraScale+™ | Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit |

# **Supported Intel Devices**

SoC Blockset supports execution on Intel devices shown in this table.

| <b>Device Family</b> | Board                        |
|----------------------|------------------------------|
| Intel Arria® 10      | Arria 10 SoC Development Kit |

| <b>Device Family</b> | Board                         |
|----------------------|-------------------------------|
| Intel Cyclone® V     | Cyclone V SoC Development Kit |

#### **SoC Board Support Packages**

The SoC Blockset support packages contain the definition files for all supported boards. You can download one or more vendor-specific support packages. To generate executables and execute on hardware, download at least one of these packages.

To see the list of SoC Blockset support packages, visit "SoC Blockset Supported Hardware" on page 6-2. To download an SoC Blockset support package, on the MATLAB **Home** tab, in the **Environment** section, click **Add-Ons > Get Hardware Support Packages**.

#### See Also

"Hardware Implementation Pane Overview" | SoC Builder

#### **More About**

- "Generate SoC Design" on page 3-5
- "SoC Blockset Supported Hardware" on page 6-2

#### **Code Generation of Software Tasks**

A Simulink model containing a Task Manager block simulates task execution. When a model gets deployed to an SoC hardware board, the SoC Blockset automatically creates and assigns the tasks to threads, links interrupts, messages, and system events to the generated code of the model.

#### **Timer-Driven Tasks**

An SoC Blockset model, when implemented onto hardware as generated and compiled code, uses an operating system (OS) timer to drive the base-rate time step of the model. All time based signals derive their time steps, known as sub-rates, from the base-rate time step of the model. A timer-driven task, created from the Task Manager block, uses a counter that increments at each base-rate timer step. When the counter reaches an integer multiple of the base-rate, the generated code posts to the semaphore associated with that task. Posting to the semaphore unblocks the thread and executes the task.

#### **Event-Driven Task**

Each event-driven task created from the Task Manager block gets a unique semaphore. A unique event elsewhere in the system posts to that semaphore and puts the task thread into the running state. OS kernel handles the management of the task thread until it returns to the waiting state.

#### See Also

SoC Builder | Task Manager

#### **More About**

- "Event-Driven Tasks" on page 1-4
- "Timer-Driven Task" on page 1-9

# **Generate SoC Design**

#### In this section...

"Step 1: Set Up FPGA Design Software Tools" on page 3-5

"Step 2: Start SoC Builder" on page 3-6

"Step 3: Prepare Model for Generation" on page 3-6

"Step 4: Select Project Folder" on page 3-8

"Step 5: Select Build Action" on page 3-8

"Step 6: Validate Model" on page 3-9

"Step 7: Build Model" on page 3-9

"Step 8: Connect Hardware" on page 3-9

"Step 9: Load and Run" on page 3-10

This tutorial outlines the steps to build hardware and software executables for your model and execute your application. Your SoC model can contain a processor model, an FPGA model, or both.

**SoC Builder** requires that you have a support package installed, based on the board selected in the configuration parameters. For more information, see "SoC Blockset Supported Hardware" on page 6-2.

## **Step 1: Set Up FPGA Design Software Tools**

To generate SoC binaries, you must include the path to Vivado or Quartus executables in your system path. If the executables are not already in your system path, use hdlsetuptoolpath function to add them to your path.

#### Xilinx Software

Use the hdlsetuptoolpath function to set up your system environment for accessing Xilinx tools from MATLAB. This function adds the specified installation folder to the MATLAB search path. The following example assumes that Xilinx Vivado is installed at C:\Xilinx\Vivado\2018.3\bin.

```
hdlsetuptoolpath('ToolName','Xilinx Vivado', ...
'ToolPath','C:\Xilinx\Vivado\2018.3\bin\vivado.bat')
```

#### **Intel Software**

Use the hdlsetuptoolpath function to set up your system environment for accessing Intel tools from MATLAB. This function adds the specified installation folder to the MATLAB search path. The following example assumes that Intel FPGA design software is installed at C:\Intel\18.1\quartus\bin64.

```
hdlsetuptoolpath('ToolName','Altera Quartus II', ...
'ToolPath','C:\Intel\18.1\quartus\bin64')
```

#### **Step 2: Start SoC Builder**

In the Simulink toolstrip, on the **System on Chip** tab click **Configure, Build & Deploy**.

#### **Step 3: Prepare Model for Generation**

Prepare your model by selecting a starting point for the build process, and then review the model information.

**Note** If no support package is detected, **SoC Builder** first prompts you to install the required support package.



Specify the starting point for the build process. If you are building a model that you have not built before, select **Build model**. If you previously completed the build process and saved the binaries in a folder, select **Load existing binaries**.

SoC Builder parses the model and displays the top model, the FPGA model (if one exists), and the ARM model (if one exists). Review this information for accuracy. If it seems incorrect, revise the model, save, and restart the **SoC Builder** tool.

**Note** If your FPGA model is set to a frame-based Simulink model variant, then the **SoC Builder** does not display the model in the table. To make it visible in the table, set the model variant to sample-based and recompile your design.

Click Next.

The next page of the **SoC Builder** provides information about the memory map of the model. To open the **Memory Mapper**, click **View/Edit**. Review the base addresses and offsets, and edit them if needed.

**Note** This memory map step of the **SoC Builder** is visible only if you have an FPGA model in your top model. If your FPGA model is set to frame-based modeling - then no FPGA model is visible, and therefore there is no access to the **Memory Mapper** tool.

Click Next.

#### **Step 4: Select Project Folder**

Specify a path to a project folder by entering the path in the **Project Folder** text box or by browsing to a folder location. The SoC Builder places all generated files, including reports, executables, and the bitstream, in this specified folder.

If you selected **Load existing binaries** as the starting point for the build process, specify the project folder location of the previous binaries and reports.

Click Next.

# **Step 5: Select Build Action**

In the **Select Build Action** section, select one of these options:

- Build, load and run Select this option to generate HDL and C code, build software executables and an FPGA programming file from your model. After building, SoC
   Builder loads the generated code to the FPGA board and executes the application.
- Build only Select this option to generate HDL and C code, build software
  executables and an FPGA programming file from your model. SoC Builder saves the
  generated binaries in a folder, and you can continue execution later.

• **Build and load for external mode** – Select this option to build the design and run it in external mode. External mode enables you to tune parameters on the FPGA without having to rebuild the FPGA design. It also enables logging data from the FPGA and displaying it on the host. For more information about external mode, see "Tune and Monitor Models Running on Target Hardware" (Simulink).

#### **Step 6: Validate Model**

Check the model against the selected board and generate a report. Check the report to ensure that the design is generated as expected.

#### **Step 7: Build Model**

To generate a bitstream for your FPGA design and a compiled executable for your software, click **Build**.

Clicking **Build** opens an external shell and runs third-party tools for synthesis and implementation of the design. The generation time depends on the complexity of your model and your host computer. Once the generation is complete, the bitstream is generated with your model name. **SoC Builder** generates a JTAG testbench script if you selected the **Include MATLAB as AXI Master** option in the configuration parameters. The script shows how to set up MATLAB as an AXI Master and configure your FPGA design over JTAG. You can customize the script to create your own test bench. For more information about MATLAB as an AXI Master, see support package documentation: "SoC Blockset Supported Hardware" on page 6-2.

## **Step 8: Connect Hardware**

Review the IPv4 address, SSH Port number, and login credentials. Edit any of these values if necessary. This step is critical if you have more than one board connected to the host computer, so that **SoC Builder** can identify the correct port connection. Verify that the displayed IP address matches the IP address for the board you intend to use.

Verify that the board is connected to the host with an Ethernet cable, and then click **Test Connection** to test the physical connection to the board.

**Note** This step in the **SoC Builder** is visible only if your top model includes a processor model.

#### Step 9: Load and Run

**Note** If your top model includes an FPGA model, but no processor model, the button shows as **Load**.

Verify that your board is connected to the host computer.

- If a processor model is present in your top model, connect to the board with an Ethernet cable.
- If the top model includes an FPGA model, but no processor model, connect to the board with a JTAG cable.

Click **Load and Run**. This action loads the generated bitstream to the FPGA, programs the processor, and runs the application.

If you selected **Tune parameters and monitor signals in external mode** in step 5, this action loads the bitstream to the FPGA and opens the model in external mode. You can now choose signals for logging and monitoring or change tunable parameters. In the **System on Chip** tab, in the **Run on Hardware** section, you can click **Monitor and Tune** to run the instrumented application on hardware. Click **Connect** if you previously built and loaded your design to an FPGA. This action connects your instrumented Simulink model to the FPGA model.

#### See Also

SoC Builder

# **Analyze Performance on SoC Device**

- "Code Instrumentation Profiler" on page 4-2
- "Kernel Instrumentation Profiler" on page 4-4
- "Profile Task Execution on Processor" on page 4-7
- "Memory Performance Information from FPGA Execution" on page 4-9

#### **Code Instrumentation Profiler**

In a code instrumentation profiler, code gets added into the generated code to record the start and stop times of each task executing on the processor. The recorded start and stop times of each task are sent to the development computer to be saved, processed, and displayed. The instantaneous state of each task gets inferred from the combined start and stop times and priorities of all the tasks within the process.

Consider a simple model with two tasks, one high- and low-priority executing on an embedded processor and measured by a code instrumentation profiler. This diagram shows the measurements made by the code instrumentation profiler and the inference on the individual task states resulting from these measurements.



Inspecting the diagram, it shows that the state of the low-priority task gets inferred from the higher-priority task's execution. Since only the start and end times of task execution get measured, some pertinent data can be lost, specifically kernel latency. As kernel latency precedes the start of the task, the actual time of the interrupt event is not directly observed and the start time of the task can assumed to be delayed from the actual time of the interrupt. Furthermore, when a task moves from the preempted to the running states, the kernel latency gets added into the interpreted execution time of the lower-priority task.

Code instrumentation profiling benefits from easy generation and deployment. On models deployed to processors with operating systems running a single process in a single tasking mode, task execution timing measurements be made with sufficient accuracy and precision. As only a minimal amount of code to record the start and stop times of the task

get added to each task, the impact of the task execution timing by the code instrumentation profiler, inmost cases, can be considered negligible.

#### Limitations

Code instrumentation profiling provides lightweight measurement tooling of generated code. However, two limitations must be considered when measuring the task execution and duration times using the code instrumentation profiler. These limitations are as follows:

- Cannot measure kernel latency or components of kernel latency. Kernel latency can
  generally be treated as a constant. As kernel latency impacts all task start up time
  with approximately equal effect, an estimate of the kernel latency could be deduced
  with comparisons to the task timings in simulation. For more information on kernel
  latency, see "Kernel Latency" on page 1-12
- Cannot capture the effect of commands issued to the OS kernel from within the task. The code instrumentation profiler records the start time, end time, and preemption of a task by other tasks. However, when the task makes a call to the OS kernel, the code instrumentation profiler does not record the change of control between the task and the kernel as a preemption. As kernel calls, without detailed knowledge of the timing, can be treated as non-deterministic, the measured task duration cannot be reliably measured using this type of profiler. For more information on task duration, see "Task Duration" on page 1-16.

#### See Also

Task Manager

#### **More About**

- "Kernel Instrumentation Profiler" on page 4-4
- "Kernel Latency" on page 1-12
- "Task Duration" on page 1-16

#### **Kernel Instrumentation Profiler**

A kernel instrumentation profiler uses a subset of the software tools and libraries included in the Linux® kernel for monitoring the actions made by the kernel to manage the execution of processes running on the SoC hardware. SoC Blockset uses LTTng, an open source tracing framework for Linux, as a kernel instrumentation profiler to monitor the execution of tasks and events of the Simulink model deployed on the SoC hardware. For more information, see the LTTng website.

Unlike a code instrumentation profiler, a kernel instrumentation profiler directly measures the conditions and changes in state for all tasks by monitoring the Linux OS kernel. This diagram shows the measurements made in a multitasking process with high and low priority tasks.



When a high priority task preempts a low priority task, the low priority task enters into the *Preempted* state and the high priority task enters into the *Running* state. After the high priority task completes execution, the scheduler resumes the preempted low priority task.

When using a kernel instrumenation profiler, the LTTng tracing framework traces the task state transitions directly from the Linux kernel, enhancing task execution time. In

comparison, when you use a code instrumentation profiler, it incorrectly includes the kernel latency in the execution time of the task.

Kernel instrumentation profiling provides these advantages.

- Operates in background without the need to instrument the code
- · High accuracy of timing measurements
- Knowledge of task execution and task state transtion directly from the kernel
- Exact knowledge of the processor core where the task executes

# Limitations

You can perform kernel instrumentation profiling only on SoC hardware that runs using a Linux OS.

# See Also

Task Manager

# **More About**

- "Code Instrumentation Profiler" on page 4-2
- "Kernel Latency" on page 1-12
- "Task Duration" on page 1-16

# **Profile Task Execution on Processor**

Measure task execution timing on an embedded processor using a code or kernel instrumentation profiler. Using external mode, a model with a Task Manager block executing on an SoC hardware board, the code or kernel instrumentation profiler stream the execution timing of the task to the host computer, displaying the data in the Simulation Data Inspector (SDI) in real time. You can also record the task execution data, which you can use with the Task Manager block to play back the task execution in simulation.

# **Task Profiling with External Mode**

- **1** Open the Simulink model to profile.
- 2 Select **Model Configuration Parameters > Hardware Implementation**, and then set **Hardware board** to an SoC Blockset supported hardware board.
- 3 In the Hardware board settings > Task profiling on processor section, enable Show in SDI.



- 4 Select Code or Kernel for code or kernel instrumentation profiling, respectively. Based on the profiling method you select, execution data will be collected from the processor and displayed on the SDI. For more information about the instrumentation methods, see "Kernel Instrumentation Profiler" on page 4-4 and "Code Instrumentation Profiler" on page 4-2.
- 5 Optional enable **Save to file** to log the measured task execution data into a file and save that file to <model>\_ert\_rtw/instrumented/diagnostics folder on your host PC. You can use this recorded data file in a Task Manager block to simulate the task execution timing on your model.

- 6 Click **Run** to deploy and execute the code on the hardware board. This step opens the **Simulation Data Inspector**, displaying the streaming data for the tasks and processors on which the tasks are executing.
- 7 Click **Stop** to end the model execution.

You can access and examine the logged data in the code generation folder used by the model. For more information on accessing the recorded streaming profiled data, see "Recording Tasks for Use in Simulation" on page 2-16. To use the recorded data in simulation, see "Task Execution Playback using Recorded Data" on page 2-7.

# See Also

Task Manager

#### **More About**

- "Recording Tasks for Use in Simulation" on page 2-16
- "Code Instrumentation Profiler" on page 4-2
- "Kernel Instrumentation Profiler" on page 4-4

# **Memory Performance Information from FPGA Execution**

#### In this section...

"Memory Performance Plots" on page 4-11

"Burst Waveforms" on page 4-18

"Configuring and Querying the AXI Interconnect Monitor" on page 4-19

Similar to the memory performance plots generated in simulation, you can collect memory interconnect traffic information from a design running on the FPGA. You can then generate similar performance plots. You can also capture the memory transaction information to view in the **Logic Analyzer** tool similar to the burst transactions from the memory controller in simulation. Use these plots to monitor real memory performance, debug and improve the design, and compare them against the memory performance obtained in simulation.

To include an AXI interconnect monitor (AIM) IP in your design, in the configuration parameters of the model, select the **Include AXI interconnect monitor** option under **Hardware Implementation** > **Target hardware resources** > **FPGA design (debug)**. The AXI interconnect monitor IP collects information from the design while it is running on the FPGA. You can query this information from MATLAB by using the JTAG connection. All memory masters in your FPGA are connected to the AXI interconnect monitor IP. These masters can include Memory Channel and Memory Traffic Generator blocks that you generated HDL code for or any other masters in your design.



The **SoC Builder** tool generates a JTAG test bench script for your design. The script collects the performance metrics from the AXI interconnect monitor and launches the performance plot application, which plots the memory performance plots for bandwidth, number of bursts, and transaction latencies. These plots are similar to the plots of memory performance in simulation. You can also modify the script to collect and display memory transaction waveforms similar to the burst waveforms of memory controller in simulation. For information on the simulation memory performance, see "Simulation Performance Plots" on page 2-34.

For an example, see "Analyze Memory Bandwidth Using Traffic Generators" on page 5-53, which shows how to monitor memory performance in both simulation and when running on the FPGA. The script generated by the **SoC Builder** tool uses the JTAG connection to enable any traffic generators in your design, and then samples the memory

performance information from the AXI interconnect monitor IP as fast as it can. The sampling interval depends on the JTAG latency, which is typically from 10 ms to 20 ms. The script then displays plots similar to the performance plots from the Memory Controller block in your simulation. The plot displays the bandwidth, number of bursts, and transaction latency for each master.

**Note** The AXI master itself is not connected to the AXI interconnect monitor. Therefore, the hardware diagnostics do not include the memory usage plots for test-bench-only masters that initialize the memory with predetermined data.

# **Memory Performance Plots**

The script collects the performance metrics from the AXI interconnect monitor and launches the performance plot application.

#### **Memory Bandwidth Plots**

In the **Bandwidth** tab, select the masters for which you want to graph bandwidth. Click **Create Plot** to see the bandwidth, in megabytes per second, for the selected masters over the duration of the run time. This figure shows the bandwidth for the "Analyze Memory Bandwidth Using Traffic Generators" on page 5-53 example.



# **Memory Burst Plots**

In the **Bursts** tab, select the masters for which you want to graph bursts. Click **Create Plot** to see the number of bursts executed for the selected master over the duration of the run time. This figure shows the burst count for the "Analyze Memory Bandwidth Using Traffic Generators" on page 5-53 example.



# **Memory Latency Plots**

In the **Latencies** tab, select the master for which you want to graph latencies. Click **Create Plot** to see the latency, for the selected masters over the duration of the run time. This image shows the total latency for Master 1 in the "Analyze Memory Bandwidth Using Traffic Generators" on page 5-53 example. You can then zoom in to analyze the peak instantaneous latency.



You can choose from any of these options:

- **Burst request to first transfer complete** This option shows the time it takes from the moment the master issues a transaction request to the first transfer of data. This latency accounts for arbitration or interconnect delays.
- Burst execution This option shows the time it takes from the first transfer of data to the burst last transfer.
- **Burst last transfer to complete** This option shows the time it takes from last transfer to complete transaction. In case of read transaction, it is 0.
- Instantaneous Total Latency This option shows discrete total latency measurements per burst.

Each latency value plotted is an average of the respective latency, measured from the memory transactions over a sampling interval. The following figure shows an AXI4 Master protocol write and read transaction on the hardware showing each of these latencies.

#### Write Transaction



Read Transaction



In read transaction, **Burst last transfer to complete latency** is zero.

#### **Data Overflow**

In Profile mode, the collectMemoryStatistics function samples memory metrics: bandwidth, burst, and latencies values from the hardware after every sample. After that, the function resets the metric counters and then starts the counters again for the next sample. If any of the metric counters exceeds the limit of  $2^{32}-1$  within the sampling interval, the counter is overflowed and the corresponding sample is indicated with \* in the plot.

# **Burst Waveforms**

You can also modify the generated script to configure the AXI interconnect monitor to collect event data for each burst transaction. You can view these events in the **Logic Analyzer** waveform viewer to examine arbitration behavior. Specify the number of transactions to capture, **Trace capture depth**, in the configuration parameters of the model, under **Hardware Implementation** > **Target hardware resources** > **FPGA design (debug)**.



The waveforms show the event type (BurstIdle, BurstRequest, BurstExecuting, or BurstDone) and these parameters of the burst transaction:

- MasterID -- ID number of the memory master that made the request
- DataWidth -- Data width in bits
- BurstLength -- Number of data words in the burst request
- BurstsTransferred -- Number of bursts in this request (valid only with BurstDone event)
- BytesTransferred -- Number of bytes in this request (valid only with BurstDone event)

You can compare these waveforms with the waveforms captured from your Memory Controller block in simulation.

# Configuring and Querying the AXI Interconnect Monitor

The AXI interconnect monitor (AIM) is an IP core that collects performance metrics for an AXI-based FPGA design. Create an socIPCore object to set up and configure the AIM IP, and use the socMemoryProfiler object to retrieve and display the data.

For an example of how to configure and query the AIM IP in your design using MATLAB as AXI Master, see "Analyze Memory Bandwidth Using Traffic Generators" on page 5-53. Specifically, review the soc\_memory\_traffic\_generator\_axi\_master.m script that configures and monitors the design on the device.

#### **Select Memory Monitor Mode**

The AXI interconnect monitor can collect two types of data. Choose Profile mode to collect average transaction latency, and counts of bytes and bursts. In this mode, you can open a performance plot tool, and then configure the tool to plot bandwidth, burst count, and transaction latency. Choose Trace mode to collect detailed memory transaction event data and view the data as waveforms.

```
perfMonMode = 'Profile'; % or 'Trace'
```

#### **Configure the AXI Interconnect Monitor**

To obtain diagnostic performance metrics from your generated FPGA design, you must set up a JTAG connection to the device from MATLAB. Load a .mat file that contains structures derived from the board configuration parameters. This file was generated by the **SoC Builder** tool. These structures describe the memory interconnect and masters configuration such as buffer sizes and addresses. Use the <code>socHardwareBoard</code> object to set up the JTAG connection.

```
load('soc_memory_traffic_generator_zc706_aximaster.mat');
hw0bj = socHardwareBoard('Xilinx Zynq ZC706 evaluation kit','Connect',false);
AXIMasterObj = socAXIMaster(hw0bj);
```

Configure the AIM. The socIPCore object provides a function that performs this initialization. Next, set up a socMemoryProfiler object to gather the metrics.

```
apmCoreObj = socIPCore(AXIMasterObj,perf_mon,'PerformanceMonitor','Mode',perfMonMode);
initialize(apmCoreObj);
profilerObj = socMemoryProfiler(hwObj,apmCoreObj);
```

#### **Retrieve Diagnostic Data**

To retrieve performance metrics or signal data from a design running on the FPGA, use the socMemoryProfiler object functions.

For Profile mode, call the collectMemoryStatistics function in a loop.

```
NumRuns = 100;
for n = 1:NumRuns
```

```
collectMemoryStatistics(profilerObj);
end
```

JTAG design setup time is long relative to FPGA transaction times, and if you have a small number of transactions in your design, they can be completed by the time you query the monitor. In this case, the bandwidth plot shows only one sample, and the throughput calculation is not accurate. If this situation occurs, increase the total number of transactions the design executes.

For Trace mode, call the collectMemoryStatistics function once. This function stops the IP from writing transactions into the FIFO in the AXI interconnect monitor IP, although the transactions continue on the interconnect. Set the size of the transaction FIFO, Trace capture depth, in the configuration parameters of the model, under Hardware Implementation > Target hardware resources > FPGA design (debug).

collectMemoryStatistics(profilerObj);

#### **Visualizing Performance Metrics**

Visualize the performance data using the plotMemoryStatistics function. In Profile mode, this function opens a performance plot tool, and you can configure the tool to plot bandwidth, burst count, and average transaction latency. In Trace mode, this function opens the **Logic Analyzer** tool to view burst transaction event data.

plotMemoryStatistics(profilerObj);

# See Also

Memory Controller | collectMemoryStatistics | plotMemoryStatistics |
socMemoryProfiler

#### **More About**

- "Simulation Diagnostics" on page 2-34
- "Analyze Memory Bandwidth Using Traffic Generators" on page 5-53

# See Also

# **Examples**

# **Random Access of External Memory**

This example shows how to model external memory accesses from FPGA for rotating an ASCII art image. Many applications require FPGA to access memory in random fashion as per the requirements of algorithm. You will learn how to design memory address generation along with other AXI4 master signals to read and write specific regions of memory using SoC Blockset. You will simulate, implement and verify your design on hardware.

Supported hardware platforms:

- Artix® 7 35T Arty FPGA evaluation kit
- Xilinx® Kintex® 7 KC705 development board
- Xilinx Zynq® ZC706 evaluation kit
- Xilinx Zynq UltraScale<sup>™</sup> + MPSoC ZCU102 Evaluation Kit
- Altera® Cyclone® V SoC development kit
- Altera Arria® 10 SoC development kit

## **Design Task**

The ASCII art image is encoded as 24-by-64 matrix of uint8 characters. The design task is to rotate the image by modeling AXI4 Master interfaces in FPGA logic for external memory access. By simulating the design with external memory model and the AXI4 protocol, you verify the behavior at application design time. This saves time otherwise spent in debugging the design on hardware during the implementation phase.

The overall dataflow is as described in figure below. The image is stored in the external memory at the memory region from address 0x00000000 to 0x000017FF. FPGA algorithm reads the image from this region and rotates it by writing it in the reverse order into the memory region from 0x00001800. Finally, the data is read back from the memory.



#### **Model Structure**

The models are structured using Model references. Top model ' $soc_image_rotation$ ' includes the FPGA model ' $soc_image_rotation_fpga$ ' using Model block as model reference.

# Memory Controller DDR Testbench Input FDGA Memory Region Testbench Output DDR 仈 MT@trIO 仈 wrCtrlOu rdCtrlO AVIA Mont HW From dialog boolean(1) Copyright 2019 The MathWorks, Inc.

#### Random Access of External Memory

The top model covers the following areas:

- Testbench Input: It models the stimuli to set up the design for simulation. The AXI4
   Master Source block initializes the input image data to the external memory. The
   Algorithm Start block sends a Start signal to the FPGA algorithm via Register
   Channel block. Open preload function soc\_image\_rotation\_init.m to see how model
   parameters and input data are initialized.
- **Testbench Output**: The AXI4 Master Sink block models the reading of the output image data from the external memory. The output data is saved in the variable AXI4MasterSinkContent in the workspace. Open stop function soc\_image\_rotation\_post.m to see how input data and output data are plotted.
- Memory: Memory system is modeled using one Memory Controller and two Memory Channel blocks. Input Read Memory Channel block models memory region 1 where input image is stored and Output Write Memory Channel block models memory region 2 where the rotated image is stored.
- **FPGA**: This area instantiates the FPGA model reference which models the logic for AXI4 Master interfaces and data rotation.



FPGA model implements the algorithm in three subsystems, AXI4MasterRead, ReverseDataOrder and AXI4MasterWrite. Open FPGA subsystem for image rotation:

As the positive edge of start signal is detected, AXIMasterRead reads one line of image data and deliver it to ReverseDataOrder for reverses the order of data. The reversed data is then written to external memory by AXIMasterWrite subsystem. Once the data for one line is written, it sends a signal request\_next\_line to trigger reading of next line by AXIMasterRead. This cycle continues until all lines of the image are processed.

Open AXI4MasterReadController and AXI4MasterWriteController blocks to inspect the MATLAB® code for AXI4 Master interfaces. These blocks design the addressing logic for read and write operations as per AXI4 protocol. SoC Blockset supports AXI4 Master protocol and for timing diagrams of AXI4 signals, please refer to Model Design for AXI4 Master Interface Generation.

#### **Simulation**

Run the model and open the Logic analyzer from the FPGA model. Notice the following key points:

- One line of data is written/read by masters in one burst. Since each line is 64 characters long; the burst length is 64 (0x40). Note this value on signals rd\_len and wr\_len.
- Each character has 4 bytes as it is extended to uint32 data type, which makes the length of line 64x4 = 256 (0x100) bytes. Therefore, addresses increment/decrement by 0x100. Note this on rd\_addr and wr\_addr signals.

- One read burst is followed by one write burst. Observe how rd\_dvalid and wr dvalid toggle alternatively.
- request\_next\_line asserts after each write burst, which trigger the next read burst.



The input and output images are plotted at the end of simulation:

| ■ Input ASCII ART                                                  | _ | × |
|--------------------------------------------------------------------|---|---|
|                                                                    |   |   |
|                                                                    |   |   |
|                                                                    |   |   |
|                                                                    |   |   |
| ; f::<br>; h2mm2;                                                  |   |   |
| iGS;SX5s;<br>2GS;:hXSs;<br>,AGS::,iOX5SH:<br>-ASS::,iOX5A          |   |   |
| .xx38: .i9925i3c<br>r32x35: _r830x5iscs<br>-r3325x00: _rex0x32s22a |   |   |
| ; x222523x : ; rs23h325ii00                                        |   |   |
|                                                                    |   |   |
| e\$2555a;<br>rX5ar,                                                |   |   |
|                                                                    |   |   |
|                                                                    |   |   |
|                                                                    |   |   |
|                                                                    |   |   |



#### **Implementation**

Following products are required for this section:

- HDL Coder<sup>™</sup>
- SoC Blockset Support Package for Xilinx Devices, or
- SoC Blockset Support Package for Intel® Devices

To implement the model on a supported FPGA board, use the SoC Builder application. Make sure you have installed required products and FPGA vendor software before implementation.

Open SoC Builder by clicking 'Configure, Build, & Deploy' button in the toolstrip and follow these steps:

- Select 'Build Model' on 'Setup' screen. Click 'Next'.
- Click 'View/Edit Memory Map' to view the memory map on 'Review Memory Map' screen. Notice that the base address 0x000000000 is assigned to Input Read Memory Channel block, and base address 0x00001800 is assigned to Output Write Memory Channel block. The AXI4 address is the sum of base address and address from FPGA algorithm. For example, wr\_addr from FPGA algorithm starts with 0x1700. The output data will be written to the external memory from address 0x00001800 + 0x1700 = 0x00002F00. Refer to Model Design for AXI4 Master Interface Generation for more information about base address register calculation. Click 'Next'.
- Specify project folder on 'Select Project Folder' screen. Click 'Next'.
- Select 'Build, load and run' on 'Select Build Action' screen. Click 'Next'.
- Click 'Validate' to check the compatibility of model for implementation on 'Validate Model' screen. Click 'Next'.
- Click 'Build' to begin building of the model on 'Build Model' screen. An external shell will open when FPGA synthesis begins. Click 'Next' to 'Load Bitstream' screen.

The FPGA synthesis may take more than 30 minutes to complete. To save time, you may want to use the provided pre-generated bitstream by following these steps:

- Close the external shell to terminate synthesis.
- Copy pre-generated bitstream to your project folder by running the command below and then.
- Click 'Load' button to load pre-generated bitstream.

```
copyfile(fullfile(matlabroot, 'toolbox', 'soc', 'socexamples', 'bitstreams', 'soc_image_rote
```

To run this example, copy the example test bench to your project folder.

```
copyfile(fullfile(matlabroot, 'toolbox', 'soc', 'socexamples', 'soc_image_rotation_aximasto
```

Enter the following command to run the test bench:

```
soc_image_rotation_aximaster
```

The test bench performs the following operations:

• Initializes image rotation IP

- Writes input image data to external memory
- Starts the image rotation operation
- Reads back and display output image data from external memory

If your FPGA board is not Xilinx Zynq ZC706 evaluation kit you need to do the following settings in the configuration parameters of the top model before launching the SoC Builder.

- Select the 'Hardware board' under 'Hardware Implementation' panel to match your board.
- Uncheck 'Include processing system' under 'Hardware Implementation -> Target hardware resources -> FPGA design (top-level)' panel.
- Set 'Interconnect data width (bits)' to '32' under 'Hardware Implementation -> Target hardware resources -> FPGA design (mem channels)' panel.

Available pre-generated bitstreams are:

- 'soc image rotation fpga-zc706.bit'
- · 'soc image rotation fpga-arty.bit'
- 'soc\_image\_rotation\_fpga-kc705.bit'
- 'soc\_image\_rotation\_fpga-a10soc.sof'

Modify the *copyfile* command and example test bench to match your board and selected project folder as appropriate. Note that pre-generated bitstream may not work if you customized the memory map.

#### Conclusion

This example shows modeling of AXI4 Master interfaces for accessing external memory in random fashion using SoC Blockset by rotating an ASCII art image. You can use this as a guide to design your own algorithm to access memory directly using AXI4 Master protocol.

# Packet-Based ADS-B Transceiver

Packet-based systems are common in wireless communications. Data is received over the air and is decoded as discrete packet data on a compute device. For given system requirements, it is difficult to design a system and implement directly on SoC as it often involves long iterations of debugging and integration on hardware since hardware effects are difficult to account for at design time. In this example, you will design packet-based airplane tracking application based on Automatic Dependent Surveillance Broadcast (ADS-B) standard, partitioned between FPGA and embedded processor. Unlike traditional methods, you will simulate the application design with memory interface before implementation on hardware using SoC Blockset to shorten development time. You will then validate the design on hardware by automatically generated code from the model.

#### Supported Hardware Platforms:

- Xilinx® Zynq® ZC706 evaluation kit + Analog Devices® FMCOMMS2/3/4 card.
- ZedBoard<sup>™</sup> + Analog Devices FMCOMMS2/3/4 card.

#### **Design Task and System Requirements**

As per ADS-B standard a message packet contains a total of 120 bits which has an 8 bit preamble and 112 bits of information about the aircraft including its position and velocity. For an introduction to the Mode-S signaling scheme and ADS-B technology for tracking aircraft, refer to the 'Airplane Tracking Using MATLAB®' example in Communications Toolbox.

Our task is to design a system to receive ADS-B messages off the air and decode with following performance requirements:

- Latency: 0.5 seconds
- Drop sample rate: < 1 in 105 messages
- Throughput: 0.125 MBps (for capacity of maximum 300 aircrafts)

#### **Design Using SoC Blockset**

**Design Parameters**: Data is transferred from FPGA to processor across shared memory as a frame of samples. There are two key design parameters, **Frame Size** and **Number of Buffers** which affect the above performance requirements.

• Frame Size: Frame Size is the number of samples in a frame.

• **Number of Buffers**: Number of frame buffers in memory channel. Data is continuously written into memory by FPGA algorithm as frame buffers which are then read by processor to execute its identification algorithm task.

Select the design parameters to satisfy the system requirements as follows:

**Design to Meet Latency Requirement**: Latency is the time period between when the data is received by the FPGA logic and the data is ready to be processed by the processor. It comprises of two parts, latency through the FPGA logic and the latency for the processor to be available to process data.

Latency through the FPGA logic is the time required for data processing through the FPGA. This is typically on the order of a number of clock cycles with the clock running in MHz range. Latency for the processor to be available to process data, is determined by the time it takes for samples to transfer from FPGA to processor through FIFO and memory frame buffers. If we size FPGA FIFO equivalent to one frame buffer, then the maximum latency can be written as follows:

```
MaxLatency = (NumberOfBuffers + 1) * (TimeToGatherAFrame)
```

As the Time to gather a frame is directly proportional to Frame Size, therefore, the maximum latency in the data transfer is directly proportional to Frame Size and Number of buffers.

Time to gather a frame is a constant for continuously streaming applications and is equal to Frame Size times the FPGA output sample time. However, for asynchronous packet-based systems, this time also depends on the frequency of arrival of packets. If you choose a Frame size larger than the packet size, then you may have to wait for an indeterminate time for arrival of all the packets required to make a frame. If you choose the packet size smaller than packet size, then it will adversely affect the throughput. Therefore, for asynchronous packet based systems, Frame Size equal to packet size is a reasonable choice. This allows each packet to transfer to processor as soon as the FPGA processing is completed, thereby reducing the latency.

For this example, the decoded packet length is 112 bits, packed into four 32-bit samples. So, the frame size is 4.

**Design to Meet Throughput Requirement**: Throughput is the amount of data produced as output per unit of time. This is a function of the data processing in FPGA and the data transfer & processing by processor. For FPGA logic, the data is processed at clock frequencies of the order of MHz and an output is produced every few clock cycles.

For data transfer and processing by processor, it depends on Frame Size. A typical tradeoff is larger Frame size results in higher throughput but it increases the latency. Conversely, a smaller frame size results in lower latency but it decreases the throughput.

**Design to Meet Drop Samples Requirement**: An application may tolerate occasional drop data caused by the variations in task execution durations. Frame buffers in a memory channel hold data when it can't be immediately processed by the processor. Therefore, increasing the number of Frame buffers reduces the sample drop-outs but it adversely affects the latency as explained earlier.

Choose the Number of Buffers value such that you are able to meet the Drop samples requirement without affecting the maximum latency requirement.

For this example, the mean task duration, as measured on ZC706 is 114us. Each packet duration is 120us. Even if the packets arrive back to back, they can be processed with minimal number of frame buffers since on average the task is processed before the new packet arrives. So, set the number of frame buffers to the minimum possible, 3.

**Create an SoC Model**: Use the "SDR Template" on page 1-53 for creating an SoC model for wireless communications applications.

# External I/O PPGA Memory Me

#### Packet-Based ADS-B Transceiver

Copyright 2019 The MathWorks, Inc.

The top model is depicted with bounding boxes that segment the model as follows:

- External I/O: This part of the model contains the AD9361 RF Input and Output blocks
  which are connected to each other using a simplified channel model. In addition this
  region has LED blocks that connect the FPGA logic.
- **FPGA**: The FPGA section of the model contains the FPGA algorithms which are designed in a separate model and instantiated here using model reference.
- **Memory**: This section models the memory channel between FPGA and processor. It simulates the latencies in the HW/SW connection.
- Register Channel: This section models three FPGA registers that are configured by the processor.
- **Processor**: This section contains the Task Manager that is connected to processor model. The Task Manager controls the scheduling of processor tasks. The processor algorithm and initialization tasks are modeled in a separate model and is instantiated here using model references.

FPGA model contains the ADS-B Transmitter Algorithm that transmits test ADS-B packets at a variable rate and the ADS-B Receiver Algorithm that decodes received ADS-B messages.



The processor model contains *Processor Algorithm* that unpacks the received ADS-B packets into information bits and sends them via UDP Send block to another system for reporting the aircraft information. The processor algorithm task is denoted as *dataTask* in

the Task Manager block and is specified as event-driven. The Task Manager schedules data asynchronously by means of a buffer ready event *rdEvent* in the memory channel.



The *Initialize Function* subsystem initializes appropriate hardware configuration registers. The AD9361 blocks set the center frequency, gain mode, and baseband sample rate of the attached FMC RF board. The other blocks model three memory mapped configurations of the ADS-B packet detector datapath. These include selection of input to receiver algorithm, transmit period of test packets from FPGA and threshold value for detection algorithm.

The model  $soc\_ADSB\_UDP\_HostPrintout$  is a host UDP-based receive model that decodes ADS-B messages. Run this model in parallel to the ADSB simulation or deployment model to display the decoded ADS-B messages and also optionally map the aircraft location.

# Host Model for Receiving ADS-B Messages



Copyright 2019 The MathWorks, Inc.

#### **Simulate**

Run the model to visualize data transfer between the FPGA and the processor. The time period between the arrival of packets is a function of number of aircrafts. Given system requirement of detecting 300 aircrafts, there will be on average 300\*6.2 = 1860 messages per second (or a message every 1/1860 = 0.54 ms). You can set the number of aircrafts using the variable *NumAircraft* which in turn sets the period in the *Initialize Function* subsystem. The default setting is 300 to match the allowable system capacity.

Open the Logic Analyzer window to see the waveforms, and notice that the memory transfers are taking place in buffers of 4 samples, or 16 bytes.



To view the external memory bandwidth usage, open the  $Mem\ Controller$  block, select the  $Performance\ tab$  and click  $Launch\ performance\ app$ . Select all the masters and click  $Create\ Plot$ . The plot shows the bandwidth of 0.125 MBps. Since 4 bytes of data is transferred every 32us, the expected bandwidth is 4/32e-6=0.125 MBps.



Using the Simulation Data Inspector, you can visualize the task execution schedule. The data task is driven by the event from FPGA notifying the processor that a packet has been decoded by the FPGA, written to external memory, and read by the DMA driver.



To see the decoded messages, run the companion UDP receive model. This model will display the aircraft tracking information on a GUI.



# **Hardware Requirements Analysis**

As discussed earlier, since mean task duration of 114us is less than the packet duration of 120us, the messages are not dropped on average, during the transfer to the processor. This is confirmed by looking at the number of dropped samples at FIFO using signal *icFIFODroppedCount* in the Simulation Data Inspector.



The SoC model can be used to explore the design space. Consider the worst-case scenario when the plane messages are received densely and there is more computation load on the processor. You can modify the model settings and simulate and determine whether packets are dropped in this more aggressive scenario.

Set the *NumAircraft* to 990 (a new message every 163us) to simulate back to back arrival of plane messages. Modify the task specification on the Task Manager block to simulate more computation load on processor. On the Simulation tab, choose the second distribution by setting the Percent value to 100% on second row and 0% on the first row. This assigns a mean task duration of 163us, which will result in some task executions taking longer than allowed. Set the simulation time to  $0.1 \, \text{ms}$  and simulate. For 990 planes, the messages arrival rate is  $990*6.2 = 6138 \, \text{messages}$  per second. The drop packet requirement is therefore,  $6138/105 = 58 \, \text{messages}$  per second or  $5.8 \, \text{messages}$  in  $0.1 \, \text{sec}$ . Upon simulation notice in the Logic Analyzer that this requirement is violated as  $18 \, \text{messages}$  have been dropped.

# Implement and Run on Hardware

Following products are required for this section:

- HDL Coder™
- Embedded Coder®
- "SoC Blockset Supported Hardware" on page 6-2

To implement the model on a supported SoC board use the SoC Builder tool. By default, the model will be implemented on **Xilinx® Zynq® ZC706 evaluation kit** as it is configured with that board. To open SoC Builder, select the 'System on Chip' tab in the Simulink toolstrip, and click the 'Configure, Build, & Deploy' button. Once SoC Builder opens, follow these steps:

- Select 'Build Model' on 'Setup' screen. Click 'Next'.
- Click 'View/Edit Memory Map' to view the memory map on 'Review Memory Map' screen. Click 'Next'.
- Specify project folder on 'Select Project Folder' screen. Click 'Next'.
- Select 'Build, load and run' on 'Select Build Action' screen. Click 'Next'.
- Click 'Validate' to check the compatibility of model for implementation on 'Validate Model' screen. Click 'Next'.
- Click 'Build' to begin building of the model on 'Build Model' screen. An external shell will open when FPGA synthesis begins. Click 'Next'.
- Click 'Test Connection' on 'Connect Hardware' screen to test the connectivity of host computer with SoC board. Click 'Next' to go to 'Run Application' screen.

The FPGA synthesis may take more than 30 minutes to complete. To save time, you may want to use the provided pre-generated bitstream by following these steps:

- Close the external shell to terminate synthesis.
- Copy pre-generated bitstream to your project folder by running the command below and then,
- Click 'Load and Run' button to load pre-generated bitstream and run the model on SoC board

copyfile(fullfile(matlabroot, 'toolbox', 'soc', 'socexamples', 'bitstreams', 'soc\_ADSB\_fpga

**Implementation on ZedBoard**: To implement the model on ZedBoard, you must first configure the model to ZedBoard and set the following example parameters. Open **Model Configuration Parameters**, navigate to **Hardware Implementation** tab and perform the following:

- Select ZedBoard from the drop-down list under 'Hardware board' on both top and processor model.
- Navigate to Target hardware resources > FPGA design (top level) tab, enable Include MATLAB as AXI Master IP for host-based interaction and set IP core clock frequency (MHz) to 4 MHz.

- Navigate to Target hardware resources > FPGA design (debug) tab and enable Include AXI Interconnect monitor.
- Navigate to Device details and select Support long long.

Next, open SoC Builder and follow the steps as previously stated for Xilinx® Zynq® ZC706 above. Modify the *copyfile* command to match Zedboard bitstream 'soc ADSB fpga-zedboard.bit'.

## **Profiling Results**

To enable processor task profiling, open configuration parameters and navigate to **Hardware Implementation > Hardware Board settings > Task Profiling on processor** and select 'Show on SDI' and 'Save to file'. Set the Simulation stop time to 10 seconds and run the model in external mode. After simulation is completed, open Simulation Data Inspector (SDI) and navigate to the latest run and add signal <code>DataReadTask</code> to the plot. Observe that the simulation model accurately predicted how the application would perform on hardware.



## **Summary**

This example showed how SoC Blockset is used to design packet-based ADS-B standard to meet system requirements. By simulating the design with memory channel as interface between the FPGA and the Processor you validated that the system requirements of throughput and drop packets are met at the design time. You implemented the design on SoC device from the model and verified the results on hardware. Although ADS-B is not a computationally intensive standard, it is useful to demonstrate the design process for

packet-based systems intended for implementation on a SoC device. You can follow the same design procedure for even more computationally intensive requirements for this application or another packet-based application.

# **Histogram Equalization Using Video Frame Buffer**

Video processing applications often store a full frame of video data to process the frame and modify the next frame. In such designs video frames are stored in external memory while FPGA resources are used to process same data. This example shows how to design a video application with HDMI input and output performing histogram equalization using external memory for video frame buffering.

Supported hardware platform

• Xilinx® Zynq® ZC706 evaluation kit + FMC-HDMI-CAM mezzanine card

### **Design Task and System Requirements**

Consider an application involving continuous streaming of video data through the FPGA. In the top model soc\_histogram\_equalization\_top the FPGA calculates the histogram of the incoming video stream, in the 'FPGA' subsystem, while streaming the same video stream to external memory for storage. Once the histogram has been calculated and accumulated across the entire video frame, a synchronization signal is toggled to trigger the read back of the stored frame from external memory. The accumulated histogram vector is then applied to the video stream read back from external memory to perform the equalization algorithm. The external memory frame buffer is modeled using the 'Memory Channel' block in AXI4-Stream Video Frame Buffer mode.

The 'HDMI Input' block reads a video file and provides video data and control signals to downstream FPGA processing blocks. Video data is in YCbCr 4:2:2 format, and the control signals are in the pixel control bus format. The 'HDMI Output' block reads video data and control signals, in the same format as output by the 'HDMI Input' block, and provides a visual output using the Video Display block.

The Push Button block enables bypassing of the histogram equalization algorithm, routing the unprocessed output from the external memory frame buffer to the output.

# Input | Figure based processing | Indicated | Indicat

### Histogram Equalization Using Video Frame Buffer

There are a number of requirements to consider when designing an application that interfaces with external memory:

- **Throughput**: What is the rate that you need to transfer data to/from memory to satisfy the requirements of your algorithm? Specifically for vision applications, what is the frame-size and frame-rate that you must be able to maintain?
- Latency: What is the maximum amount of time that your algorithm can tolerate between requesting and receiving data? For vision applications, do you need a continuous stream of data, without gaps? Are you able to buffer samples internal to your algorithm in order to prevent data loss when access to the memory is blocked?

For this histogram equalization example, we have defined the following requirements:

- Throughput must be sufficient to maintain a 1920x1080p video stream at 60 framesper-second.
- Latency must be sufficiently low so as not to drop frames.

With the above throughput requirement, we can calculate the value that is required for the frame buffer:

$$1920 \times 1080 \times 60 = 124.416 \,\mathrm{Msps}$$

As the video format is YCbCr 4:2:2, we require 2 bytes-per-pixel (BPP), this equates to a throughput requirement of

$$2 \times 124.416 = 248.832 \,\mathrm{MB/s}$$

Because the algorithm must both write and read the video data to/from the external memory, this throughput requirement must be doubled, for a total throughput requirement of

$$2 \times 248.832 = 497.664 \,\mathrm{MB/s}$$

## **Design Using SoC Blockset**

In general, your algorithm will be a part of a larger SoC application. In such applications, it is likely that there will be other algorithms also requiring access to external memory. In this scenario, you must consider the impact of other algorithm's memory accesses on the performance and requirements of your algorithm. Assuming that your algorithm shares the memory channel with other components, you should consider the following:

- What is the total available memory bandwidth in the SoC system?
- How will your algorithm adapt to shared memory bandwidth?
- Can your algorithm tolerate an increased read/write latency?

By appropriate modeling of additional memory consumers in the overall application, you can systematically design your algorithm to meet your requirements in situations where access to the memory is not exclusive to your algorithm.

To avoid modeling of all memory readers and writers in the overall system, you can use 'Memory Traffic Generator' blocks to consume read/write bandwidth in your system by creating access requests. In this way, you can simulate additional memory accesses within your system without explicit modeling.

# **Modeling Additional Memory Consumers**

When implemented on hardware, the HDMI output requires an additional frame buffer for synchronization of the video stream data between clock-domains, and introduces an additional memory consumer in the overall system. You can model this using Memory Traffic Generator blocks to simulate the additional memory consumption. As we are modeling both read and write transactions, we will use two Memory Traffic Generator blocks - one each for read and write.

Based on the throughput calculation for our 1080p video stream, we know that the additional frame buffer will require  $^{497.664}$  MB/s of bandwidth for simultaneous read and write access.

The write transactions are modeled by HDMI Buffer Write and the read transactions are modeled by HDMI Buffer Read. The block mask for both are shown below.



The total burst requests are configured as *inf*, as we want to simulate a continuous stream of data to/from the memory. This will ensure that the traffic generator block will continue to issue transaction requests for the entirety of the simulation.

The burst size is specified as 192, which is the 1/10th of pixels per line. As the burst size is specified in bytes, this is equivalent to one tenth of a single line of a single component

of the output video stream, i.e. a single line of the Y-component of the YCbCr 4:2:2 video stream.

The time between burst is specified as 1/1296000. This can be expanded as

$$\frac{192}{1080 \times 1920 \times 60 \times 2}$$

where,

192 is the number of bytes per burst,

1080 is the number of lines in the video stream,

1920 is the number of pixels per line in the video stream,

60 is the number of frames-per-second and,

2 is the number of components in our video stream.

Putting the above parameters together, we can calculate our requested throughput as follows:

$$192 \times 1296000 = 248.832 \,\mathrm{MB/s}$$

And, as we have two traffic generators to simulate both read and write transactions, the total bandwidth consumption will be  $^{497.664\,\mathrm{MB/s}}$ 

Simulating the system with the above settings results in the following Memory Bandwidth Usage plot.



Here, the memory masters are as follows:

- 1 Master 1: Frame Buffer write
- 2 Master 2: Frame Buffer read

- **3** Master 3: HDMI Buffer Write (Memory Traffic Generator)
- 4 Master 4: HDMI Buffer Read (Memory Traffic Generator
- **5** Master 5: Contention (Memory Traffic Generator) (commented out)

You can see that all 4 active masters are consuming 248.8 MB/s of memory bandwidth.

**More Memory Consumers:** Consider that your algorithm is part of a larger system, and a secondary algorithm is being developed by a colleague or third-party. In this scenario, the secondary algorithm will be developed separately for the interest of time and division of work. Rather than combine the two algorithms into a single simulation, you can model the memory access of the secondary algorithm using a Memory Traffic Generator, and simulate the impact, if any, that it will have on your algorithm.

For example, assume that you are provided with the following memory requirements for the secondary algorithm:

• Throughput: 650 MB/s

Given that we know that at any one time the primary algorithm, plus the HDMI output frame buffer, is consuming  $\sim 995$  MB/s of the memory bandwidth, and our total available memory bandwidth is 1600 MB/s, we know that with the total bandwidth requirement for our system exceeds the total available bandwidth by  $\sim 50$  MB/s.

To enable the modeling of the secondary algorithm memory access, uncomment the Contention Memory Traffic Generator block. The block mask settings are shown below.



Simulating the system with the secondary algorithm's memory accesses, results in the following Memory Bandwidth Usage plot.



As you can see, the combined required memory bandwidth exceeds the available bandwidth at around 0.03s - when the secondary algorithm begins memory access requests, resulting in the other masters not achieving their required throughput. Looking at the logic analyzer waveform, we can see this manifested as dropped buffers for the Frame Buffer write master, meaning that the input video frame will not be written to memory.



## Implement and Run on Hardware

Following products are required for this section:

- HDL Coder<sup>™</sup>
- "SoC Blockset Supported Hardware" on page 6-2

To implement the model on a supported SoC board use the SoC Builder application. Open the mask of 'FPGA' subsystem and set model variant to 'Pixel based processing'.

Comment out 'HDMI Buffer Write', 'HDMI Buffer Read' and 'Contention' blocks.

Click, 'Configure, Build, & Deploy' button in the toolstrip to open SoC Builder

- Select 'Build Model' on 'Setup' screen. Click 'Next'.
- Click 'View/Edit Memory Map' to view the memory map on 'Review Memory Map' screen. Click 'Next'.
- Specify project folder on 'Select Project Folder' screen. Click 'Next'.
- Select 'Build, load and run' on 'Select Build Action' screen. Click 'Next'.
- Click 'Validate' to check the compatibility of model for implementation on 'Validate Model' screen, Click 'Next'.
- Click 'Build' to begin building of the model on 'Build Model' screen. An external shell will open when FPGA synthesis begins. Click 'Next'.
- Click 'Next' to 'Load Bitstream' screen.

The FPGA synthesis may take more than 30 minutes to complete. To save time, you may want to use the provided pre-generated bitstream by following these steps:

- Close the external shell to terminate synthesis.
- Copy pre-generated bitstream to your project folder by running the command below and then,
- Click 'Load and Run' button to load pre-generated bitstream and run the model on SoC board

```
copyfile(fullfile(matlabroot, 'toolbox', 'soc', 'socexamples', 'bitstreams', 'soc_histogram
```

To run the model, execute the following aximaster test bench for soc\_histogram\_equalization\_top\_aximaster.

The following figure shows the Memory Bandwidth usage when the application is deployed on hardware.



# **Summary**

You designed a video application with real time HDMI I/O and frame buffering in external memory. You explored effects of other consumers of memory on overall bandwidth. You used SoC Builder to implement the model on hardware and verify the design.

# Streaming Data from Hardware to Software

This example presents a systematic approach to design the data-path between hardware logic (FPGA) and embedded processor using SoC Blockset. Applications are often partitioned between hardware logic and embedded processor on a system-on-chip (SoC) device to meet throughput, latency and processing requirements. You will design and simulate the entire application comprising of FPGA & processor algorithms, memory interface and task scheduling to meet the system requirements. You will then validate the design on hardware by generating code from the model and implementing on a SoC device.

## Supported hardware platforms:

- Xilinx® Zynq® ZC706 evaluation kit
- Xilinx Zynq UltraScale™ + MPSoC ZCU102 Evaluation Kit
- ZedBoard<sup>™</sup> Zynq-7000 Development Board
- Altera® Cyclone® V SoC development kit
- Altera Arria® 10 SoC development kit

# **Design Task and System Requirements**

Consider an application that continuously process data on the FPGA and the embedded processor. In this example, the FPGA algorithm filters the input signal and streams the resulting data to the processor. In the implementation model soc\_hwsw\_stream\_implementation, the *Buffer* block represents the transfer of data from FPGA to processor. The processor operates on the buffered data and classifies the data as either high or low frequency in the *Processor Algorithm* subsystem. FPGA generates a test data of either low or high frequency sinusoid based on the DIP switch setting in *Test Data Source* subsystem.



Copyright 2019 The MathWorks, Inc.

The application has following performance requirements:

Throughput: 10e6 samples per second

Maximum latency: 100ms

• Samples dropped: < 1 in 10000

# **Challenges in Designing Datapath**

The FPGA processes data sample by sample while the processor operates on a frame of data at a time. The data is transferred asynchronously between FPGA and processor, and the duration of software task can vary for each execution. Therefore, a queue is needed to hold the data between FPGA and processor to prevent data loss. This queue is implemented in two stages, one as a FIFO of bursts of data samples in FPGA memory and other as a series of frame buffers in external memory. You will need to set three parameters related to the queue: frame size (number of samples in a frame of data), number of frame buffers and FIFO size (number of bursts of samples in FIFO).



Figure 1

These design parameters affect performance and resource utilization. Increasing the frame size allows more time for software task execution and to meet throughput requirements at the cost of increasing latency. Typically, you set these parameters only when you are ready to implement on hardware, which presents the following challenges:

- It is difficult to debug issues like dropping of samples in hardware due to lack of visibility.
- It is difficult to design your application efficiently without first evaluating the effects of hardware interfaces. It can take many design-implementation iterations as you can assess performance only via implementation on hardware.
- It is difficult to optimize design since performance and cause-effect relationships are difficult to determine through implementation.

Ideally you want to account for these hardware effects while you are developing the application at design time, before implementing and running on hardware. One way to satisfy these requirements is to simulate the hardware effects, at design time. If you can simulate the variation in task durations, utilization of memory buffers/FIFOs and external memory transfer latencies, you can evaluate their effects on application design and implement the proven design on hardware. SoC Blockset allows you to simulate these effects so you can evaluate the performance of the deployed application before running on hardware.

# **Design Using SoC Blockset**

Create an SoC model soc\_hwsw\_stream\_top from the implementation model soc\_hwsw\_stream\_implementation using the "Stream from FPGA to Processor Template" on page 1-49. The top model includes FPGA model soc hwsw stream fpga and processor

model soc\_hwsw\_stream\_proc instantiated as model references. The top model also includes Memory Channel and Memory Controller blocks which model shared external memory between FPGA and processor. These were earlier modeled using buffer block in the implementation model. To improve simulation performance, FPGA algorithm is also modeled for Frame-based processing soc\_hwsw\_stream\_fpga\_frame and is included as model variant subsystem at the top level. You can select the model to run in Frame-based or Sample-based processing by selecting from the mask of FPGA subsystem.

#### Streaming Data from Hardware to Software



**Design to Meet Latency Requirement**: Latency in the datapath from FPGA to processor comprises of the latency through the FPGA logic and the time for data transfer from FPGA to processor through memory channel. In this example, the FPGA clock is 10MHz and the latency is on the order of nanoseconds. This is negligible in comparison with latency within the memory channel, which is on the order of milliseconds. Therefore, let us focus on designing for latency for data transfer in the following manner.

Begin with a few potential frame sizes and calculate Frame period for each frame size in Table -1. Frame period is the time between two consecutive frames from FPGA to processor. For this example, FPGA output sample time is 10e-6 as a valid data is output every 100 clock cycles from the FPGA.

FramePeriod = Framesize \* FPGAOutputSampleTime

Latency of the memory channel is due to time elapsed by samples in the queue of frame buffers and FPGA FIFO. Let us size FPGA FIFO equivalent to one frame buffer. To stay within the maximum latency requirement, calculate the number of frame buffers for each frame size as per:

$$(NumFrameBuffers + 1) * FramePeriod <= MaxLatency$$

Maximum latency allowed for this example is 100 ms. Since the number of buffers account for maximum latency requirement, for all the cases in Table -1, latency requirement is met. A minimum of 3 frame buffers is needed in external memory for data transfer. While one of the frame buffers is written by FPGA, the other frame buffer is read by processor. Therefore, Case #8-10 from the table below are rejected as they violate the minimum buffer requirement.

| #  | Frame Size | Frame<br>period<br>(ms) | Number<br>of buffers | Meets or Violates requirements |
|----|------------|-------------------------|----------------------|--------------------------------|
| 1  | 5          | 0.05                    | 1999                 |                                |
| 2  | 100        | 1                       | 99                   |                                |
| 3  | 800        | 8                       | 11                   |                                |
| 4  | 1000       | 10                      | 9                    |                                |
| 5  | 1600       | 16                      | 5                    |                                |
| 6  | 2000       | 20                      | 4                    |                                |
| 7  | 2400       | 24                      | 3                    |                                |
| 8  | 8000       | 80                      | <1                   | Violates min buffers req       |
| 9  | 18000      | 180                     | <1                   | Violates min buffers req       |
| 10 | 30000      | 300                     | <1                   | Violates min buffers req       |

Table -1

To visualize the latency, simulate the model and open  $Memory\ Channel\ block$ , go to  $Performance\ tab$  and click on  $View\ performance\ plots$ . Select all the latency options under  $Plot\ Controls$  and click  $Create\ Plot$ . As captured in Figure - 2, you will notice that the composite latency meets the < 100 ms requirement.



Figure - 2

**Design to Meet Throughput Requirement**: On average, the software tasks processing must complete within a frame period, as otherwise, task will overrun leading to dropping of data and violate the throughput requirement. i.e.

#### FramePeriod > MeanTaskDuration

There are various ways of obtaining mean tasks durations corresponding to frame sizes for your algorithm, which are covered in "Task Execution" on page 5-77 Example. Mean task durations for various frame sizes are captured in Table 2.

| #  | Frame Size | Frame<br>period | Number of<br>buffers | Mean Task<br>Duration | Meets or Violates requirements |
|----|------------|-----------------|----------------------|-----------------------|--------------------------------|
|    |            | (ms)            |                      | (ms)                  | •                              |
| 1  | 5          | 0.05            | 1999                 | 0.059                 | Violates throughput            |
| 2  | 100        | 1               | 99                   | 1.06                  | Violates throughput            |
| 3  | 800        | 8               | 11                   | 7.858                 |                                |
| 4  | 1000       | 10              | 9                    | 9.61                  |                                |
| 5  | 1600       | 16              | 5                    | 15.3                  |                                |
| 6  | 2000       | 20              | 4                    | 19.067                |                                |
| 7  | 2400       | 24              | 3                    | 22.812                |                                |
| 8  | 8000       | 80              | <1                   | 76.56                 | Violates min buffers req       |
| 9  | 18000      | 180             | <1                   | 175.23                | Violates min buffers req       |
| 10 | 30000      | 300             | <1                   | 289.52                | Violates min buffers req       |

Table -2

To simulate the model with the parameters corresponding to rows (#2-#7) in the table use the function set\_hwsw\_stream\_set\_parameters function with row # as an argument. Set the model parameters for row # 2 as below:

```
soc_hwsw_stream_set_parameters(2); % row # 2
```

Since the Mean Task Duration of 1.06 ms is more than the Frame Period of 1.0 ms, the data is dropped in the memory channel. Open Logic Analyzer and notice that signal *icFIFODroppedCount* is increasing throughout the simulation as captured in Figure 3, indicating accumulating amount of dropped data.



Since data is dropped during transfer from FPGA to processor through memory, this is reflected as a drop in throughput. Open  $Memory\ controller$  block, go to  $Performance\ tab$  and click on  $Plot\ data\ throughput\ button\ under\ Performance\ tab\ to\ see\ the\ memory\ throughput\ plot\ as\ in\ figure\ 4.$  Note that the throughput is less than the required 0.4 MBps. Since the FPGA output data sample time is 10e-6 and each sample is 4 bytes wide, the required streaming throughput for the system is 4 bytes/10e-6 = 400 KBps.



Figure - 4

**Design to Meet Drop Samples Requirement**: Since the task durations can vary for many reasons like different code execution paths and variation in OS switching time, it is possible that data is dropped in the memory channel. Specify the mean task execution duration and statistical distribution for task durations in the mask of Task Manager block.

Size the FIFO equivalent to one frame buffer. Set the FIFO burst size to 16 Bytes and calculate the FIFO depth:

```
FIFO_depth = FrameSize/FIFOBurstSize
```

Now, simulate the model for 100 sec (10e6 samples at 10e-6 samples per second) for cases # 3-7. Open the Logic analyzer and note the number of samples dropped on signal *icFIFODroppedCount*.

```
soc_hwsw_stream_set_parameters(3); % set the model parameters for #3
```

Open Simulation Data Inspector and add signals from memory channel as shown in Figure 5 below. Note that as buffers usage (signal *buffAvail*) increase to the maximum 11, the FIFO usage (signal *isFIFOEntries*) begin to increase. When FIFO is completely used, the samples get dropped (signal *isFIFODroppedCount*)



The results of simulation for all the cases #3-7 and resultant sample dropped per 10000 are tabulated in Table 3.

| #  | Frame<br>Size | Frame<br>period | Number<br>of buffers | Mean Task<br>Duration | Avg Samples<br>dropped per | Meets or Violates<br>requirements |
|----|---------------|-----------------|----------------------|-----------------------|----------------------------|-----------------------------------|
|    |               | (ms)            |                      | (ms)                  | 10000                      |                                   |
| 1  | 5             | 0.05            | 1999                 | 0.059                 |                            | Violates throughput               |
| 2  | 100           | 1               | 99                   | 1.06                  |                            | Violates throughput               |
| 3  | 800           | 8               | 11                   | 7.858                 | 172.6                      | Violates drop samples             |
| 4  | 1000          | 10              | 9                    | 9.61                  | 0                          | Meets all requirements            |
| 5  | 1600          | 16              | 5                    | 15.3                  | 1                          | Meets all requirements            |
| 6  | 2000          | 20              | 4                    | 19.067                | 2.25                       | Violates drop samples             |
| 7  | 2400          | 24              | 3                    | 22.812                | 3.9                        | Violates drop samples             |
| 8  | 8000          | 80              | <1                   | 76.56                 |                            | Violates min buffers req          |
| 9  | 18000         | 180             | <1                   | 175.23                |                            | Violates min buffers req          |
| 10 | 30000         | 300             | <1                   | 289.52                |                            | Violates min buffers req          |

Table - 3

The highlighted entries (rows #4 and #5) are valid design choices since they meet throughput, latency and drop samples requirement.

# Implement and Run on Hardware

Following products are required for this section:

- HDL Coder<sup>™</sup>
- Embedded Coder®
- SoC Blockset Support Package for Xilinx Devices, or
- SoC Blockset Support Package for Intel Devices

For more information about support packages, see "SoC Blockset Supported Hardware" on page 6-2

To implement the model on a supported SoC board use the SoC Builder tool. Open the mask of 'FPGA' subsystem and select model variant to 'Sample based processing'. By default, the model will be implemented on **Xilinx® Zynq® ZC706 evaluation kit** as it is

configured with that board. To open SoC Builder click, 'Configure, Build, & Deploy' button in the toolstrip and follow these steps:

- Select 'Build Model' on 'Setup' screen. Click 'Next'.
- Click 'View/Edit Memory Map' to view the memory map on 'Review Memory Map' screen. Click 'Next'.
- Specify project folder on 'Select Project Folder' screen. Click 'Next'.
- · Select 'Build, load and run' on 'Select Build Action' screen. Click 'Next'.
- Click 'Validate' to check the compatibility of model for implementation on 'Validate Model' screen. Click 'Next'.
- Click 'Build' to begin building of the model on 'Build Model' screen. An external shell will open when FPGA synthesis begins. Click 'Next'.
- Click 'Test Connection' on 'Connect Hardware' screen to test the connectivity of host computer with SoC board. Click 'Next' to go to 'Run Application' screen.

The FPGA synthesis may take more than 30 minutes to complete. To save time, you may want to use the provided pre-generated bitstream by following these steps:

- Close the external shell to terminate synthesis.
- Copy pre-generated bitstream to your project folder by running the command below and then,
- Click 'Load and Run' button to load pre-generated bitstream and run the model on SoC board

```
copyfile(fullfile(matlabroot, 'toolbox', 'soc', 'socexamples', 'bitstreams', 'soc_hwsw_streams')
```

While the application is running on hardware, toggle the DIP switch on your board to change the test data from 'low' to 'high' frequency and notice the blinking of corresponding LED on the board. You can also read the samples dropped count in the model running on external mode. Thus, you verify that your implementation from SoC Blockset model matches the simulation and meets the requirements.

**Implementation on other boards**: To implement the model on a supported board other than Xilinx® Zynq® ZC706 evaluation kit board, you must first configure the model to the supported board and set the example parameters as below. Open **Model Configuration Parameters**, navigate to **Hardware Implementation** tab and perform the following settings:

• Select your board from the drop-down list under **Hardware board** on both top and processor model.

- Navigate to Target hardware resources > FPGA design (top level) tab and enable Include MATLAB as AXI Master IP for host-based interaction and set IP core clock frequency (MHz) to 10 MHz.
- Navigate to Target hardware resources > FPGA design (debug) tab and enable Include AXI Interconnect monitor.

Next, open SoC Builder and follow the steps as previously stated for Xilinx® Zynq® ZC706 above. Modify the *copyfile* command to match the bitstream corresponding to your board. Available pre-generated bitstreams are:

- 'soc hwsw stream top fpga-zc706.bit'
- 'soc hwsw stream top fpga-zedboard.bit'
- 'soc\_hwsw\_stream\_top\_fpga-zcu102.bit'
- · 'soc hwsw stream top fpga-c5soc.sof'
- 'soc hwsw stream top fpga-a10soc.sof'

In summary, this example showed you a systematic approach to design the datapath between hardware logic and embedded processor using SoC Blockset. You chose design parameters of frame size, number of frame buffers and FIFO size to meet the system performance requirements of throughput, latency and drop samples. By simulating and visualizing the effects of these parameters on the complete model containing hardware logic, processor algorithms, external memory and processor task durations, you uncovered issues like loss of throughput, latency and dropping of samples before implementing on hardware. This workflow ensures that the design works on hardware before implementation and avoids long design-implementation iterations.

# **Analyze Memory Bandwidth Using Traffic Generators**

This example demonstrates how to analyze memory bandwidth for an SoC application. In memory-intensive hardware designs, you may have multiple masters accessing a common DDR memory. In such cases, it is important to analyze the dynamic requirement of all memory masters to guide algorithm design and hardware board requirement for deployment. You can simulate the memory traffic using Memory traffic generators, analyze the bandwidth usage and verify it on the hardware.

## Supported hardware platforms

- Xilinx® Zynq® ZC706 evaluation kit
- Xilinx® Kintex® 7 KC705 development board

## **Design Task**

Consider an application performing HD video processing in FPGA on real-time input and output. This application requires four memory consumers vying for DDR access simultaneously. Memory master 1 writes incoming video frames to memory and Memory master 4 reads video frames out of memory and connect to output display. Memory master 2 reads the data from memory for processing in FPGA and Memory master 3 writes the data back to memory.



Each master operates on HD video with following characteristics:

• Frame size: 1920x1080p

Pixel size: 2 Bytes (YCbCr format)

• Frame period: 1/60 = 16.67ms (for 60 FPS)

• Frame data: 1920x1080x2 = 4.1472MB

Each master requires following minimum memory bandwidth to get the frame rate of 60 FPS.

• Memory bandwidth: Frame data/ Frame period = 4.1472e6/16.67e-3 = 248.8MBps

Assume the memory controller characteristics are as follows:

• Clock frequency: 200 MHz

• Data width: 32 bits

• Burst transaction length: 128

### **Design Using SoC Blockset**

Create a model using Memory Controller and Memory Traffic Generator blocks to model four memory masters.

Memory Controller: Set the memory controller parameters in Configuration
Parameters > Hardware Implementation > Target Hardware Resources. Under
FPGA Design (mem Controllers) tab, set the clock frequency to 200 MHz and data
width to 32. Under FPGA Design (debug) tab, select Include AXI interconnect
monitor.

# Analyze Memory Bandwidth for SoC Design Using Traffic Generators



Copyright 2019 The MathWorks, Inc.

**Memory Traffic Generators 1 & 4:** Memory traffic characteristics for Master 1 and 4 are same as they represent streaming of video frames to and from memory. Set the memory traffic characteristics for masters 1 and 4 as follows:

- Burst size (in bytes): Burst transaction length \* (Data width/8) = 128\* 32/8 = 512
- Total burst requests: 4 frames data for simulation = 4 \* Traffic data/Burst size = 4\*8100 = 32400

Burst inter access time: Frame period/Number of Burst requests = 16.67e-3/8100 = 20.58e-7 sec. As a constant data traffic, the data is continuously received at a constant rate. Set the burst times as below:

- First burst time = 20.58e-7
- Random time between the bursts = [20.58e-7 20.58e-7]

Update the Memory Traffic Generator1 and Memory Traffic Generator4 block mask with above values. Set the request type for Memory Traffic Generator1 with writer and Memory Traffic Generator4 with reader. Clear the **Wait for burst done** option in both the block masks as these masters represent the masters with continuous traffic, such as HDMI Camera and display.



**Memory Traffic Generators 2 & 3:** Memory Traffic Generator2 represent reader for FPGA Algorithm and Memory Traffic Generator3 represent writer from FPGA Algorithm. Set the memory traffic characteristics for masters 2 and 3 as follows:

- Burst size (in bytes): Burst transaction length \* (Data width/8) = 128\* 32/8 = 512
- Total burst requests: 4 \* Traffic data/Burst size = 4\*8100 = 32400(4 frames data for simulation)

Burst inter access time: (Burst Length + 10)/Clock period = 6.9e-7(0.69us). To allow some randomness in burst times for read and write request of data, due to variation in demands of algorithm, set the burst times as below:

- First burst time: 7.2e-7
- Random time between the bursts: [7.2e-7 7.4e-7]



#### **Simulate**

Run the model. After completion of simulation, open the Memory Controller block and click on **View performance plots** under **Performance** tab. Select all the masters under **Bandwidth** tab and click **Create Plot**. You can notice that all masters roughly achieved a bandwidth of 190 MBps and did not meet the required 248 MBps. It is also observed by the warnings in the diagnostic viewer.



To meet the required bandwidth, modify the data width of controller from 32 to 64 in configuration parameter settings under **Target Hardware Resources**. This requires changing the Memory Traffic Generator settings accordingly as follows:

• Burst size (in bytes): Burst transaction length \* (Data width/8) = 128\* 64/8 = 1024

• Total burst requests: 4 \* Traffic data/Burst size = 4\*4050 = 16200(4 frames of data for simulation)

Burst inter access time for Memory Traffic Generators 1 & 4: Frame Period/Number of Burst requests = 16.67e-3/4050 = 41.16e-7 sec. Set the burst times as below:

- First burst time: 41.16e-7
- Random time between the bursts: [41.16e-7 41.16e-7]

There is no change in First burst time and Random time between the bursts for Memory Traffic Generators 2 and 3, since they are determined based on algorithm requirements.

The new parameter settings are as follows:



Simulate the model and open the **Bandwidth** plot from Memory Controller as mentioned earlier. Notice that Memory bandwidth achieved by Memory Traffic Generator 1 and 4 is 248 MBps. The memory bandwidth from Generator 2 and 3 is around 500 MBps. This meets the design requirement as all the masters are able to meet the real-time requirement of 248 MHz. Observe that there are no warnings on the diagnostic viewer as burst requests are not dropped.



### Implement and Run on Hardware

"SoC Blockset Supported Hardware" on page 6-2 is required for this section.

To implement the model on a supported FPGA board, use the SoC Builder application. By default, the model will be implemented on **Xilinx® Zynq® ZC706 evaluation kit** as it is configured with that board.

AXI Traffic Generator(ATG), the hardware IP Core for Memory Traffic Generator block does not support random burst inter access times and it differentiates Reader and Writer masters in arbitration policy unlike the Memory Traffic Generator block for simulation. Therefore, before implementing on hardware, modify the Memory block settings as follows:

- Make all the Memory Traffic Generators as 'Writers'
- For Memory Traffic Generator 2 and 3, set [7.2e-7 7.2 e-7] for Random time between burst to make it fixed inter burst time of 7.2e-7

Open SoC Builder from the Tools menu and follow these steps:

- Select Build Model on Setup screen. Click Next.
- Click View/Edit Memory Map to view the memory map on Review Memory Map screen. Click Next.
- Specify project folder on **Select Project Folder** screen. Click **Next**.
- Select Build, load and run on Select Build Action screen. Click Next.
- Click Validate to check the compatibility of model for implementation on Validate Model screen. Click Next.
- Click Build to begin building of the model on Build Model screen. An external shell will open when FPGA synthesis begins. Click Next to Load Bitstream screen.

The FPGA synthesis may take more than 30 minutes to complete. To save time, you may want to use the provided pre-generated bitstream by following steps:

- Close the external shell to terminate synthesis.
- Copy the pre-generated bitstream to your project folder and rename by running the below command.
- >> copyfile(fullfile(matlabroot, 'toolbox', 'soc', 'socexamples', 'bitstreams', 'soc\_memory
- Click **Load** button to load pre-generated bitstream.

To run this example, copy the example test bench to your project folder.

```
copyfile(fullfile(matlabroot, 'toolbox', 'soc', 'socexamples', 'soc_memory_traffic_generate
```

The testbench configures the generated hardware ATG IP cores for Memory Traffic Generators. To run on hardware, increase the number of burst requests by 100 times since it uses <code>MATLAB®</code> as <code>AXI</code> <code>Master</code> IP to get the samples back to <code>MATLAB®</code>, which involves substantial delay in accessing hardware. Load soc\_memory\_traffic\_generator\_zc706\_aximaster.mat file and increase the number of burst requests for all the masters in ATG configuration to 100 times. Save the .mat file requests in ATG configuration.

Enter the following command to run the test bench soc\_memory\_traffic\_generator\_aximaster.

soc\_memory\_traffic\_generator\_aximaster

After running the test bench, the following output is generated showing the memory traffic. All masters passing the bandwidth requirements.



Implementation on Xilinx® Kintex® 7 KC705 development board: To implement the model on KC705 development board, you must first configure the model to Xilinx® Kintex® 7 KC705 development board and set the following example parameters. Open **Model Configuration Parameters**, navigate to **Hardware Implementation** tab and perform the following:

 Select Xilinx® Kintex® 7 KC705 development board from the drop-down list under Hardware board.

- Navigate to Target hardware resources > FPGA design (top level) tab and enable Include MATLAB as AXI Master IP for host-based interaction.
- Navigate to Target hardware resources > FPGA design (mem controllers) tab and set Controller data width (bits) to 64.
- Navigate to Target hardware resources > FPGA design (debug) tab and enable Include AXI interconnect monitor.

Next, open SoC Builder and follow the steps as previously stated for Xilinx® Zynq® ZC706 above. Modify the copyfile command to match Kintex® 7 KC705 development board bitstream as below.

>> copyfile(fullfile(matlabroot, 'toolbox', 'soc', 'socexamples', 'bitstreams', 'soc\_memory

In summary, you simulated the memory traffic for a prospective design before designing the algorithms. You analyzed memory bandwidth and modified memory parameters to meet the design requirement. You verified the results on hardware.

# Record I/O Data from SoC Device

This example shows you how to record real-world data from hardware for use in simulation.

Supported hardware platforms:

- Xilinx® Zyng® ZC706 evaluation kit
- Xilinx Zyng UltraScale™ + MPSoC ZCU102 Evaluation Kit
- ZedBoard™ Zynq-7000 Development Board
- Altera® Cyclone® V SoC development kit
- Altera Arria® 10 SoC development kit

In many situations you may want to verify your algorithm against real-world data. This example, using the Streaming Data from Hardware to Software model, shows how to record signals from the AXI4 interface on a SoC device. This workflow allows you to focus on the processor side of the algorithm by substituting a pre-recorded data stream in place of the Simulink® FPGA design.

We recommend completing "Streaming Data from Hardware to Software" on page 5-39 example.

# 

#### Streaming Data from Hardware to Software

#### **Record Data from FPGA**

In this section, you will record data generated by the FPGA subsystem in the Streaming Data from Hardware to Software model. In this model, the FPGA subsystem generates a sinusoidal signal with frequency 1kHz or 10kHz, controlled via a DIP switch (DS1). The FPGA algorithm filters the signal and sends it to the processor through AXI4 Stream Memory Channel.



Following products are required for this section:

SoC Blockset Support Package for Xilinx® Devices

Follow the steps below to record data from FPGA:

```
hw = socHardwareBoard('Xilinx Zyng ZC706 evaluation kit', 'hostname', '10.10.10.15', 'use
```

Enter the appropriate hardware board name, IP address and the user credentials in the command above. The hardware object hw, is a communication gateway that provides control commands and I/O exchange.

**2**. Open Streaming Data from Hardware to Software model. Load the provided pregenerated FPGA bitstream for this model to hardware.

```
socLoadExampleBitstream(hw,'soc_hwsw_stream_top')
```

**3.** Create a data recorder for your hardware board.

```
dr = soc.recorder(hw);
```

**4.** Create an **AXI Stream Read** input source object and configure the source properties.

```
src = soc.iosource(hw,'AXI Stream Read');
src.devName = 'mwfpga_algorithm_wrapper_ip0:s2mm0';
samplingFrequency = 1e5;
src.dataTypeStr = 'uint32';
src.SamplesPerFrame = 1000;
src.SampleTime = src.SamplesPerFrame/samplingFrequency;
```

The samplingFrequency represents the sine wave sampling rate in the Streaming Data from Hardware to Software model.

**5.** Add the **AXI Stream Read** source to the data recording session.

```
addSource(dr,src,'AXI4 stream interface')
```

**6.** Initialize the I/O sources on the hardware board for recording.

```
setup(dr)
```

7. Use the record function to record 10 seconds of data.

```
record(dr, 10)
while isRecording(dr)
    pause (0.1);
end
```

During the recording, toggle the DIP switch (DS1) to change the frequency of signal generated by the FPGA.

**8.** Save the recorded data to a file:

```
save(dr,'sine_wave_data')
```

# **Record RF Signals**

In this section, you will capture RF signals from an AD - FMCOMMS2/3/4 RF card connected to the FPGA. The data will be streamed from the RF card to the processor using AXI4 stream interface.

Following products are required for this section:

• SoC Blockset Support Package for Xilinx® Devices

Supported hardware platforms for this section are:

- Xilinx® Zyng® ZC706 evaluation kit
- ZedBoard<sup>™</sup> Zynq-7000 Development Board

To configure RF card refer to "Manual Host-Radio Hardware Setup" (Communications Toolbox Support Package for Xilinx Zyng-Based Radio)

# RF Capture



Copyright 2019 The MathWorks, Inc.

**1**. Open RF Capture model. Load the provided pre-generated FPGA bitstream for this model to hardware.

```
socLoadExampleBitstream(hw,'soc_rfcapture')
```

2. Configure radio card.

```
rf = rfcard(hw);
rf.CenterFrequency = 1090e6;
rf.GainSource = 'AGC Fast Attack';
rf.BasebandSampleRate = 4e6;
```

```
rf.ShowAdvancedProperties = true;
rf.ShowInternalProperties = true;
rf.BISTToneMode = 'Tone Inject Rx';
rf();
3. Setup data recorder.
dr = soc.recorder(hw);
src = soc.iosource(hw,'AXI Stream Read');
src.devName = 'mwfpga_data_capture_ip0:s2mm0';
src.dataTypeStr = 'uint32';
src.SamplesPerFrame = 4000;
src.SampleTime = src.SamplesPerFrame/rf.BasebandSampleRate;
addSource(dr,src,'AXI4 stream interface');
4. Record radio signals.
setup(dr)
system(hw,'devmem 0x40010100 32 1');
record(dr,1)
while isRecording(dr)
     pause (0.1);
end
save(dr,'zynq_rf_data')
```

**5.** To playback the recorded RF data, open RF Playback model. Enter the dataset name and the source name on the **IO Data Source** block and simulate the model.



A pre-recorded dataset file  $zynq\_rf\_data.tgz$  is available at  $matlab \setminus boolbox \setminus socexamples$ .

#### See Also

"Simulate with I/O Data Recorded from SoC Device" on page 5-74

# Simulate with I/O Data Recorded from SoC Device

This example shows you how to use recorded real-world data in simulation.

Supported hardware platforms:

- Xilinx® Zynq® ZC706 evaluation kit
- Xilinx Zynq UltraScale™ + MPSoC ZCU102 Evaluation Kit
- Altera® Cyclone® V SoC development kit
- Altera Arria® 10 SoC development kit

In many situations you may want to verify your algorithm against real-world data. This example shows how to use the recorded data signal in a simulation of the generated processor system model of the complete SoC application.

We recommend completing "Streaming Data from Hardware to Software" on page 5-39 example.

#### **Use Recorded Data in Simulation**

In this section, you will simulate the processor subsystem of the SoC application model with recorded data as input. The processor subsystem of the SoC application uses AXI4 protocol to stream data from external memory and determine if the signal contained in the data is either high or low frequency. An **IO Data Source** block replaces the external memory and the FPGA subsystem of the model with playback of the AXI4 stream data. You will use data recorded in "Record I/O Data from SoC Device" on page 5-67 example.

# Signal Detection



Copyright 2019 The MathWorks, Inc.

- 1. Open Signal Detection model.
- 2. Open IO Data Source block mask.
- **3.** Click **Browse...** and select the **matlab\toolbox\soc\socexamples\zynq\_sine\_data.tgz** file containing recorded data.
- **4.** Click **Select...** and choose the data source within the data file to playback. Click **OK** to close the block mask dialog.
- **5.** Run the Simulink® model and open **Vector Scope** to observe the recorded data.
- ${f 6.}$  To access the recorded data in MATLAB®, use socFileReader.

```
h = socFileReader('zynq_sine_data.tgz');
data = getData(h,'AXI4 stream interface');
```

The returned data is a time series object of 'uint32'. To plot the data in MATLAB convert 'uint32' to 'int32'.

```
plot(data.Time, typecast(data.Data,'int32'));
```

# See Also

"Record I/O Data from SoC Device" on page 5-67

# **Task Execution**

This example shows how to simulate task execution and how to generate code and run it on an SoC hardware board.

Application development often includes simulating an algorithm to ensure the correct behavior. Such simulations usually ignore the real-time aspects of an embedded system environment. This may allow certain timing problems to remain undiscovered until the application runs on hardware.

The timing problems often lead to incorrect application behavior. SoC Blockset helps you detect these problems in simulation rather than on hardware. This can help you avoid costly debugging on hardware.

Timing problems are more likely to occur as applications become more complex. For example, rate overruns and undesired rate preemption are more frequent in applications with multiple tasks due to resource constraints and task dependencies. Simulating multitasking applications with SoC Blockset will help you in detecting these problems early.

In this example, task execution is simulated using SoC Blockset. You will learn about different techniques for simulating task duration and when to use them. You will also learn how to verify the timing specifications on hardware.

Supported hardware platforms:

- Xilinx® Zynq® ZC706 evaluation kit
- Xilinx Zynq UltraScale™ + MPSoC ZCU102 Evaluation Kit
- ZedBoard™ Zynq-7000 Development Board
- Altera® Cyclone® V SoC development kit
- Altera Arria® 10 SoC development kit

# Task Execution Case 1



Copyright 2019 The MathWorks, Inc.

#### Introduction

SoC Blockset simulates the execution of software tasks as they would execute on an SoC processor. The simulation honors the parameters of the task, such as period, priority and processor core. SoC Blockset simulates task preemption, task overruns, and concurrent task execution.

The following diagram illustrates the above-mentioned task execution simulation aspects. In the first two subplots, you can observe that Task1 executes every 0.1 s and, since they both share Core 0, Task1 preempts Task2 that executes every 0.2 s. In the third subplot, you can observe that Core 0 still has some idle time. The last two subplots show Task3 running every 0.3 s on Core 1.



To learn more about simulating task execution, see "What is Task Execution?" on page 1-2

The Task Manager block allows you to configure execution of the tasks in your model. In the block dialog, you define how many tasks you need in your system using **Add** and **Delete** buttons. On the **Main** tab of the dialog, you set the main task properties, while on the **Simulation** tab you set the simulation task properties.

The following figure illustrates the **Main** tab of the **Task Manager** block.



A task has a name so that it can be identified in the model and the various associated plots. Port labels on the **Task Manager** block use the task names for easy identification.

A task can be of two types. An event-driven task executes when triggered by an event. An event line from an IO data source block connected to the **Task Manager** block triggers the task. A timer-driven task executes with a defined period as defined in the **Main** tab of the **Task Manager**.

You define the priority of event-driven tasks in the **Main** tab of the **Task Manager**. Timer-driven task priority is assigned automatically.

In the **Task Manager** dialog you may also set the processor core on which to execute a task so that, if your hardware board has multiple cores, you may set the tasks to execute concurrently.

The **Task Manager** block also allows you to configure how task overruns are handled. For example, you may decide to drop an instance of a task if the previous task instance has not started or completed. Or, you may decide to try to catch up with the task schedule despite overruns.

To simulate real-time task effects, such as preemption and overruns, SoC Blockset requires you to provide the duration of each task. The duration is defined as the time elapsed between the task start and the task end. Ideally, you will measure the task duration on your hardware board. If that is not possible, look up the task duration in the data sheets provided by the task algorithm developers. As a last resort, you should set the duration relative to the task period or the shortest recurrence interval for aperiodic tasks.

SoC Blockset has several choices for setting the task duration. As the task duration is applied only to simulation, these choices are found in the **Simulation** tab of the **Task Manager** dialog.

The following figure illustrates the **Simulation** tab of the **Task Manager** dialog.



The most commonly used options are:

- **Dialog** Allows you to specify task duration via a normal distribution, or a combination of multiple normal distributions, using the mean and the standard deviation parameters.
- Input port Allows you to specify task duration on an instance basis. For example, you may create a model that calculates task duration and connect it to the Task Manager input port.

The following flowchart will guide you in selecting the most appropriate option.



If the duration times for your task have different distributions and causes, select the most fitting options using the flowchart as general guidance.

You can configure additional simulation and execution parameters for SoC Blockset in the model configuration dialog. Task profiling, in simulation and on processor, allows you to profile task execution, stream results to Simulation Data Inspector and save them into a file.

You can also set the kernel latency value to affect task execution in simulation. This value varies a lot but is typically much smaller than task duration. Therefore, we recommend you leave the value set to 0 s unless you can deterministically find the appropriate value for your hardware board.

The following figure shows SoC parameters related to task execution in the model configuration parameters dialog. Note that the **Task profiling on processor** panel shows only if you install all required products and hardware support packages.



The remaining steps of this example will illustrate some of the options shown in the above flowchart.

#### Simulating an Algorithm with Single Code Path

This case requires you to simulate a DSP algorithm that processes a frame of data. The following product is required for that:

• DSP System Toolbox

If you do not have this product, proceed to the next case after reviewing the description of this case.

In this case, you will learn how to model the task duration when the task algorithm has a single code path.

Assume that you are tasked with developing an application that processes RF (radio frequency) data on an SoC board. After being preprocessed in the FPGA core, the data is streamed to the processor core using the AXI4 protocol. The algorithm running on the processor core should determine whether the data contains a high-frequency or a low-frequency signal. To that end, a low-pass and a high-pass filter are applied to the data. The resulting signals are then compared to a selected threshold. Based on this description, this task has a single code path, with no major code branches. The source code for the task function might have the following form.

```
double dataReadTask(double in[])
{
    /* Frame size is always 1000 */
    int signalType; /* 0 - LP, 1 - HP */
    double out1[1000], out2[1000];
    filterLP(in, out1, 1000);
    filterHP(in, out2, 1000);
    signalType = thresholding(out1, out2, 1000);
}
```

- 1. Open the model. Note the **Test Data** subsystem. The **RF Data Source** block in the subsystem represents the external memory and the FPGA core. The **RF Data Source** block has two output ports, **Stream Data** and **event**. They output the RF data and a notification when new data frame is available, respectively.
- **2.** Note that the **RF Data Source** block generates frames of 1000 samples every 0.01 s. The frames are samples of a 1 kHz sine waveform.

- **3.** Click the **Task Manager** block. Observe that it sets an event-driven task **dataReadTask**. The task is triggered by the arrival of a new data frame.
- **4.** Click the **Simulation** tab in the **Task Manager** dialog to define the task duration for simulation.

Since the algorithm consists of two filters executing without conditions, the application has a single code path. Therefore, you follow the first left branch in the flowchart shown in the introduction and you expect that the algorithm execution times have a normal distribution.

Based on the information given by the algorithm developer, you determine that the mean execution time is 0.0095 s and that the standard deviation is 0.0001 s. To represent the real-time limits, you also decide to set the min and the max execution times to 0.00925 s and 0.00975 s, respectively.

Set the duration parameters in the **Task Manager** dialog in the **Simulation** tab as described above.

- 5. In the model, click Run to start the simulation. Wait until the simulation completes.
- **6.** From the model toolbar, open the **Simulation Data Inspector** and inspect the **dataReadTask**. Zoom in to inspect the task execution times more closely.



**7.** Run the following command to perform the statistical analysis of the task execution times. Observe the **Simulation Data Inspector** run numbers. Modify the command if your run numbers are different.

socTaskTimes('soc\_task\_execution', 'Run 1: soc\_task\_execution\_simprofile')

# Task: dataReadTask

#### Histogram of the execution times



#### Statistics of the execution times

|         | Mean      | SD         | Min     | Max     |
|---------|-----------|------------|---------|---------|
| Run # 1 | 0.0095026 | 0.00010217 | 0.00925 | 0.00975 |

Observe that the task durations vary. As expected, the histogram of the task duration times indicates that the algorithm has one code path. The duration values are clustered around the mean value of 0.0095 s.

8. Close the model without making any changes.

# Simulating an Algorithm with Two Code Paths

In this case, you will learn how to model the task duration when the task algorithm has two code paths and it can be predicted which path will be taken.

Assume that you are developing a video surveillance application. The task is to constantly process video data to determine if there was intrusion in the system. The algorithm calculates the amount of scene change between consecutive video data frames. If the scene change exceeds the selected threshold, such frames are recorded as they may be used as evidence of potential intrusion. Thus, this algorithm has two code paths. The source code of this algorithm may be represented in the following form.

```
void VideoTask(single in[], in length, double threshold)
{
    double energy;
    energy = calcSceneChange(in, length);
    if (energy > threshold)
        recordFrame(in, length);
    }
}
```

## Task Execution Case 2



Copyright 2019 The MathWorks, Inc.

1. Open the model. Note the **Data Source** block that outputs the frames of video data.

- **2.** Click the Model block and observe that the algorithm calculates motion energy between consecutive frames of data. If the calculated motion energy exceeds the threshold, the **Main Algorithm** is executed.
- **3.** Click the **Task Manager** block. Observe that it sets a timer-driven task **VideoTask**. This task runs every 0.33333 s, which is the video frame rate.
- 4. Click the **Simulation** tab in **Task Manager** dialog to define the task duration for simulation.

Since the algorithm has two code paths and it can be predicted which code path will be taken, follow the second left branch in the flowchart.

Model task duration to depend on motion energy. Depending on whether the motion energy threshold is exceeded or not, you will assign the task duration with the mean of 75% or 50% of the frame rate, respectively.

Click the **Task Duration Estimation** subsystem to understand how to model task duration.

- 5. In the model, click Run to start the simulation. Wait until the simulation completes.
- **6.** From the model toolbar, open the **Simulation Data Inspector** and inspect **VideoTask**. Zoom in to inspect the task execution times more closely.



**7.** Run the following command to perform the statistical analysis of the task execution times. Observe the **Simulation Data Inspector** run numbers. Modify the command if your run numbers are different.

socTaskTimes('soc\_task\_execution\_step2', 'Run 3: soc\_task\_execution\_step2\_simprofile

### Histogram



#### Statistics

|     |     | Mean     | SD        | Min     | Max      |
|-----|-----|----------|-----------|---------|----------|
| Run | # 1 | 0.016538 | 0.0043039 | 0.01069 | 0.030948 |

Observe that the task durations vary. As expected, the histogram of the task duration times indicates that the algorithm has two code paths.

 $\boldsymbol{8.}$  Close the model without making any changes.

#### Simulating an Algorithm with Indeterminate Number of Code Paths

In this case, you will learn how to model the task duration when the task algorithm has an indeterminate number of code paths, but the code paths are repeatable for the given set of data.

In this case, assume that you are developing a complex application that processes data on an SoC board. Due to the complexity of the processing, the algorithm has an indeterminate number of code paths. As a result, it is not possible to predict which code path will be taken. However, it is known that the distribution of task durations is repeatable in multiple experiments. The source code for such an algorithm might have the following form.

```
int myTask(int arr[], int length)
{
    int i = 0;
    int sum = 0;
    while (i < length) {
        if (arr[i] > 0)
            sum = sum + arr[i]
        i++;
    }
}
```

## Task Execution Case 3



Copyright 2019 The MathWorks, Inc.

- 1. Open the model.
- **2.** The model is set for **Xilinx Zynq ZC706 evaluation kit board**. To use a different board, go to the model Configuration Parameters dialog and select one of the supported boards listed in the **Hardware Implementation** page. Do the same for the top model and the referenced model.
- **3.** Click the **Task Manager** block and select the task **myTask**. Click the **Simulation** tab. Observe that we define the probability distribution as a combination of two normal distributions.
- **4.** Click **Run** to start the simulation. The task execution data will be streamed to the **Simulation Data Inspector**.

**5.** Run the following command to perform the statistical analysis of the task execution times obtained in simulation. Observe the **Simulation Data Inspector** run number. Modify the command if your run number is different.

socTaskTimes('soc\_task\_execution\_step3', 'Run 4: soc\_task\_execution\_step3\_simprofile

### Histogram



#### **Statistics**

|         | Mean      | SD        | Min    | Max       |
|---------|-----------|-----------|--------|-----------|
| Run # 1 | 0.0045092 | 0.0006698 | 0.0035 | 0.0065432 |

Notice that the task duration distribution obtained in simulation matches the expected results.

**6.** Close this model without making any changes.

#### **Simulating an Application using Proxy Tasks**

Assume that you are developing a complex application that adds one more task to the model developed in the previous case. The implementation of this task is not currently available, but the timing specification for this task is known. The task executes every 0.02 s with the duration described by a normal distribution. The distribution has a mean of 0.008 s and standard deviation of 0.0009 s.

# Task Execution Case 4



Copyright 2019 The MathWorks, Inc.

1. Open the model.

- **2.** The model is set for **Xilinx Zynq ZC706 evaluation kit board**. To use a different board, go to the model Configuration Parameters dialog and select one of the supported boards listed in the **Hardware Implementation** page. Do the same for the top model and the referenced model.
- **3.** Click the **Task Manager** block and select the task **proxyTask**. Click the **Simulation** tab. Observe that we define the probability distribution as a normal distribution with the parameters mentioned in the introduction of this task.
- **4.** Inside the Model block open the Proxy Task block and inspect the sample time value. The sample time value must match the period value entered in the Task Manager block. Click Cancel.
- **5.** Click **Run** to start the simulation. The task execution data will be streamed to the **Simulation Data Inspector**.
- **6.** Run the following command to perform the statistical analysis of the task execution times obtained in simulation. Observe the **Simulation Data Inspector** run number. Modify the command if your run number is different.

socTaskTimes('soc\_task\_execution\_step4', 'Run 5: soc\_task\_execution\_step4\_simprofile



Notice that the task durations obtained in simulation for both application and the proxy tasks.

7. Close this model without making any changes.

## Compare the Simulation Results with the Results on Hardware

In this section, you will compare the timing results obtained in the previous simulation to the timing results obtained on your hardware board.

#### Required products:

- · Embedded Coder
- SoC Blockset Support Package for Xilinx Devices, or

• SoC Blockset Support Package for Intel Devices

# Task Execution Case 4



Copyright 2019 The MathWorks, Inc.

- **1.** In the model toolbar change the simulation mode to **External**. The model is already set to profile task execution as it runs on hardware.
- **2.** Click **Run**. After the code is generated and built, it will start executing on your hardware. The profiling data will be streamed to **Simulation Data Inspector** in real-time.
- **3.** Run the following command to perform the statistical analysis of the task execution times obtained on hardware. Observe the **Simulation Data Inspector** run number. Modify the command if your run number is different.

socTaskTimes('soc\_task\_execution\_step4', 'Run 6: soc\_task\_execution\_step4\_procprofile

Notice that the task durations obtained on hardware match the results obtained in simulation.

4. Close this model without making any changes.

#### **Summary**

This example showed you how to simulate task execution in a multitasking operating system, how to generate code and run it on a hardware board, and how to collect the real-time task execution data.

In this example, we used simple applications, each with one task. In a typical application, however, multiple tasks must be performed. Embedded applications must run each task per defined schedule. To allow for using the processor most efficiently and to react quickly to external events, a priority-based preemptive scheduling algorithm is used.

With priority-based preemptive scheduling, when a task gets preempted, a task switch occurs. The data used by the task (task context) is saved so that it can be restored when the task resumes executing. In this example, the task switching times are dwarfed by the task duration and are not simulated. In applications with much shorter task duration, you may need to consider them.

If a hardware board has multiple processor cores, embedded applications typically attempt to use all cores for the most efficient implementation. SoC Blockset uses a priority-based preemptive scheduling algorithm even when the processor has multiple cores. SoC Blockset honors assignment of tasks per core in both simulation and generated code.

Next, we recommend completing "Streaming Data from Hardware to Software" on page 5-39 example that illustrates a systematic approach to designing a complex SoC application using SoC Blockset.

## **Timer-Driven Task**

This example shows how to use the Task Manager block in a simple system where a timerdriven task samples and modifies data generated from a random number source.

#### Task Manager and Software Application Model

The following model simulates a software application running on an ARM processor. A Task Manager block schedules the execution of the Timer Driven Subsystem, inside the Software Application Model Reference block. A Random Number block simulates an data source that the timer driven task samples.



The following model shows the Software Application model. This model contains the Timer-Driven Subsystem that executes based on the Timer Task events from the Task Manager block in the top-level model.



The Timer-Driven Task Subsystem, a Subsystem, samples a data value every 0.5 seconds from the Random Number block and applies the Algorithm. In this model, the algorithm outputs the negative scalar value of the sampled data value. The following model shows

the Algorithm subsystem contained in the Timer-Driven Subsystem. The Inport block defines the 0.5 second sampling time for the Timer Driven Subsystem visible on the Software Application model when the Schedule rates parameter is enabled.



#### **Asynchronous Simulation and Results**

Click the Run button to build and run the model. When the model finishes running, open the SDI display to see the results of the simulation. Select the Timer\_Task, original data, and processed data signals to see the effect of asynchronous task execution.



As shown in the SDI display, the running time of the Timer\_Task varies at each instance. In some cases, the duration of the previous task execution delays the start of the next task

execution. Additionally, the processed data from the task outputs at a the same time as the completion of the task execution, resulting in observed delay in the processed Data compared to the original data. As a result, despite the specified time step of 0.5 seconds, the start of execution now behaves as if the subsystem were executed on an SoC device processor with the associated real world processing limitations.

## **Event-Driven Task**

This example shows how to use the task manager block to a simple system where data from UDP source gets processed asynchronously each time a data packet arrives. The task manager block

#### Task Manager and Software Application Model

The following model simulates a software application running on an ARM processor. A Task Manager block schedules the execution of the Asynchronous Subsystem, inside the Software Application Model Reference block. An IO Data Source block simulates the network transmission of UDP packets.



The Software Application contains the Asynchronous Task Subsystem, a Function-Call Subsystem, that executes each time an event trigger occurs. An Asynchronous Task Specification block specifies the priority of the UDP Task to match the priority set in the Task Manager block. A Rate Adaptor block allows sampling of the output signal of the Asynchronous Task Subsystem at the time step of the Simulink(c) model.



The Asynchronous Task Subsystem, a Function-Call Subsystem, reads a data value from a UDP Read block and applies the Algorithm each time a new UDP data value arrives. In this model, the algorithm outputs the negative scalar value received from the UDP Read block. The following model shows the UDP block and Algorithm subsystem contained in the function-call subsystem.





### **Asynchronous Simulation and Results**

Click the Run button to build and run the model. When the model finishes running, open the SDI display to see the results of the simulation. Select the ReadUDP, original data, and processed data signals to see the effect of asynchronous task execution.



As shown in the SDI display, the Running time of the ReadUDP varies at each instance of receiving a UDP data packet. In some cases, the previous task execution delays the start of the next task execution. While, in this example, the UDP packets arrive at a fixed rate relative to the Simulink sample time, the start of the task execution is not directly dependent on the sample time. The processed data from the task outputs at a the completion of the task execution, resulting in observed delay in the processed Data compared to the original data.

# Hardware-Software Partitioning of a Motor Control Algorithm

This example shows how to model a motor controller for SoC devices by partitioning the control and calibration algorithms between the FPGA and processor of the SoC.

#### Introduction

This example shows how to partition a Field-Oriented Controller (FOC) for a Permanent Magnet Synchronous Motor (PMSM) onto an SoC device. The following diagram shows a conceptual closed-loop FOC of PMSM.



In an FOC running in closed-loop, the current control loop needs to run at a high rate, typically microseconds. In contrast, the velocity control can run at lower rates, typically milliseconds, but must react to external events, such as commanded velocity updates. By partitioning the current and velocity controllers onto the FPGA and processor cores, respectively, both control loops in the FOC can meet the above requirements.

The first model in this example is used for behavioral simulation of an closed-loop FOC with an open-loop calibration controller for a PMSM. The second model shows how the

open-loop calibration controller, closed-loop velocity controller, and closed-loop current controller can be partitioned into an SoC device using SoC Blockset. A comparison of the simulation results between the behavioral and SoC models shows the expected behavior of the controller is maintained.

#### **Behavioral Model**

The top-level structure of the behavioral model is shown below. The Plant subsystem models a PMSM with load with simulated measurements from a motor shaft encoder and current sensors. The model parameters of the motor, load, and sensors are based on the AD-FMCMOTCON2-EBZ Evaluation Board from Analog Devices®. The Controller subsystem contains the closed-loop FOC and the open-loop calibration controllers.

## Field-Oriented Control of PMSM (Behavioral Model)



Copyright 2019 The MathWorks, Inc.

The Controller is split into two subsystems, an inner Current Control loop and outer Velocity and Calibration Control loop.

The Current Control subsystem takes command current value from the Calibration and Velocity Control subsystem. The current controller uses consecutive Clarke and Park transforms to convert the AC current and voltage waveform into DC signals. A Proportional-Integral (PI) controller uses the DC signals to drive PWM switching signals to the power MOSFETs driving the PMSM.

The Velocity Control subsystem takes external commands to set the mode of the controller as either calibrating or closed-loop velocity tracking. In the calibration mode, the Mode\_Scheduler spins the motor using an open-loop velocity controller to identify the zero index of the shaft encoder. Then the controller commands and holds a zero position to identify the encoder offset. After determining the encoder offset, the velocity controller is calibrated and can be switched into closed-loop velocity control. The closed-loop velocity control also uses a PI controller, similar to the current controller.

#### **Hardware-Software Partitioned SoC Model**

The structure of the partitioned SoC model is based on the partitioning scheme shown below. The fast current controller is running on FPGA and slow velocity controller on processor. FPGA and processor communicate via AXI inteface.



The original Controller subsystem from the behavioral model has been partitioned into the processor and FPGA models, which are connected with Register Channel blocks.

## Field-Oriented Control of PMSM (SoC Model)



#### Processor

The open-loop calibration and the closed-loop velocity controllers are now inside a Model block operate as a task driven by the Task Manager block. As part of the task iteration, the controller first reads from the AXI registers using Register Read blocks, iterates the control algorithm, and then writes the updated outputs to the AXI register using the Register Write blocks. The Task Manager executes the controller task at a rate of 1kHz with an average execution duration of 0.2ms.

#### AXI Interface

Register Channel block models the AXI communication between FPGA and Processor for register read and write operations. The corresponding AXI4-Lite driver blocks, Register Read/Register Write are used in Processor Model to represent AXI4-Lite interface.

#### FPGA

The closed-loop current controller is contained in the Model block representing the FPGA of the SoC device. Since the current controller exists in the FPGA, it can write and read directly from the AXI hardware registers. The FPGA uses a 40us clock.

#### **Comparison of Behavioral and SoC Model Simulations**

1. Open and run the behavioral model. Observe the controller and motor behavior from System Response scope.



- 2. Open and run the partitioned SoC model. Observe that the controller and motor behavior matches.
- 3. Click **Data Inspector** to open the Simulation Data Inspector (SDI). Signal data for the previous model runs was automatically captured and archived in the SDI.

4. Select the **rotorVelocity** from **Run 1: soc\_motor\_behavior** and the **rotorVelocity** from **Run 2: soc\_motor\_top** into each subplot to get the following plot. Both the behavioral and partitioned models demonstrate equivalent motor velocity tracking.



5. From Run 3: soc\_motor\_top\_simprofile, select and display the Velocity\_Control\_Task and Core: 0 signals into each subplot to get the following plot. From the plot, you can observe the task execution time of the velocity controller and the CPU utilization.



## Other Things to Try

You can use this model as a template to develop an SoC model specific for your motor control hardware, e.g. FMC motor driver board - Analog Devices AD-FMCMOTCON2-EBZ. It will generate SoC design (FPGA bitstream and executable software) using HDL coder and Embedded coder. Refer to custom board support documentation for supporting customized SoC board and I/O devices.

## **Supported Hardware**

## **SoC Blockset Supported Hardware**



With the SoC Blockset, together with Embedded Coder® or HDL Coder, you can generate reference designs for Xilinx and Intel FPGA devices and SoC platforms, including Zynq, UltraScale $^{\text{TM}}$ , and SoC FPGA. These reference designs can be used with Xilinx or Intel design tools.

SoC Blockset provides support packages that help to automate integration, execution, and verification of reference designs for the SoC devices in this table.

| Support Package                                  | Vendor | Earliest Release<br>Available | Last Release<br>Available |
|--------------------------------------------------|--------|-------------------------------|---------------------------|
| SoC Blockset Support Package for Xilinx Devices  | Xilinx | R2019a                        | Current                   |
| "SoC Blockset Support Package for Intel Devices" | Intel  | R2019a                        | Current                   |

For a complete list of hardware support packages, see Hardware Support.